History log of /freebsd-current/sys/ufs/ffs/ffs_vfsops.c
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 886fd36e 10-Aug-2023 Kirk McKusick <mckusick@FreeBSD.org>

Clean up and document UFS/FFS error returns.

The ffs_inotovp() function returns a vnode from a mounted filesystem
for an inode number with specified generation number. We now
consistently return ESTALE if the inode with given generation number
no longer exists on that filesystem.

The ffs_reload() function reloads all incore data for a filesystem.
It is used after running fsck on a mounted filesystem and finding
things to fix. It now returns the EINTEGRITY error if it is unable
to find a valid superblock.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 60a41168 10-Aug-2023 Chuck Silvers <chs@FreeBSD.org>

UFS: panic rather than forcibly unmount the root fs

If the root fs is forcibly unmounted then basically every process
will die with a SEGV as soon as it tries to run again because libc.so
is gone, which leaves the system basically hung. It seems better
to just panic instead, so let's do that.

Requested-by: karels
Reviewed-by: imp, mckusick, karels
Sponsored-by: Netflix
Differential Revision: https://reviews.freebsd.org/D41387


# 831b1ff7 27-Jul-2023 Kirk McKusick <mckusick@FreeBSD.org>

UFS/FFS: Migrate to modern uintXX_t from u_intXX_t.

As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.

No functional changes intended.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 4d9b2ed3 22-Jul-2023 Mateusz Guzik <mjg@FreeBSD.org>

ufs: stop using LK_SLEEPFAIL in ffs_sync

It provides nothing as either locking succeeds or fails with ENOENT as
is.


# 1d9f3a37 06-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

Stop cleaning MNT_LOCAL on unmount

There is no point in clearing just this flag. Flags are reset on the
struct mount re-allocation for reuse anyway.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37966


# 829f0bcb 19-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add the concept of vnode state transitions

To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>

As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).

Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).

The new field lands in a 1-byte hole, thus it does not grow the struct.

Bump __FreeBSD_version to 1400077

Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759


# ed1bb254 19-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

mntfs: change mntfs_allocvp API to relock on its own

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759


# 08e5f519 05-Dec-2022 Kirk McKusick <mckusick@FreeBSD.org>

Provide more precise mount(8) failure message.

Suggested by: Xin LI
Reviewed by: kib
PR: 19683
MFC after: 1 week


# 27d673fb 27-Sep-2022 Kirk McKusick <mckusick@FreeBSD.org>

When taking a snapshot on a UFS/FFS filesystem, it must be mounted.

The "update" mount option must be specified when the "snapshot"
mount option is used. Return EINVAL if the "snapshot" option is
specified without the "update" option also requested.

Reported by: Robert Morris
Reviewed by: kib
PR: 265362
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 6b9d4fbb 13-Aug-2022 Kirk McKusick <mckusick@FreeBSD.org>

Explicitly initialize rather than reading newly allocated UFS inodes.

The function ffs_vgetf() is used to find or load UFS inodes into a
vnode. It first looks up the inode and if found in the cache its
vnode is returned. If it is not already in the cache, a new vnode
is allocated and its associated inode read in from the disk. The
read is done even for inodes that are being initially created.
The contents for the inode on the disk are assumed to be empty. If
the on-disk contents had been corrupted either due to a hardware
glitch or an agent deliberately trying to exploit the system, the
UFS code could panic from the unexpected partially-allocated inode.

Rather then having fsck_ffs(8) verify that all unallocated inodes
are properly empty, it is easier and quicker to add a flag to
ffs_vgetf() to indicate that the request is for a newly allocated
inode. When set, the disk read is skipped and the inode is set to
its expected empty (zero'ed out) initial state. As a side benefit,
an unneeded disk I/O is avoided.

Reported by: Peter Holm
Sponsored by: The FreeBSD Foundation


# e6886616 13-Aug-2022 Kirk McKusick <mckusick@FreeBSD.org>

Move the ability to search for alternate UFS superblocks from fsck_ffs(8)
into ffs_sbsearch() to allow use by other parts of the system.

Historically only fsck_ffs(8), the UFS filesystem checker, had code
to track down and use alternate UFS superblocks. Since fsdb(8) used
much of the fsck_ffs(8) implementation it had some ability to track
down alternate superblocks.

This change extracts the code to track down alternate superblocks
from fsck_ffs(8) and puts it into a new function ffs_sbsearch() in
sys/ufs/ffs/ffs_subr.c. Like ffs_sbget() and ffs_sbput() also found
in ffs_subr.c, these functions can be used directly by the kernel
subsystems. Additionally they are exported to the UFS library,
libufs(8) so that they can be used by user-level programs. The new
functions added to libufs(8) are sbfind(3) that is an alternative
to sbread(3) and sbsearch(3) that is an alternative to sbget(3).
See their manual pages for further details.

The utilities that have been changed to search for superblocks are
dumpfs(8), fsdb(8), ffsinfo(8), and fsck_ffs(8). Also, the prtblknos(8)
tool found in tools/diag/prtblknos searches for superblocks.

The UFS specific mount code uses the superblock search interface
when mounting the root filesystem and when the administrator doing
a mount(8) command specifies the force flag (-f). The standalone UFS
boot code (found in stand/libsa/ufs.c) uses the superblock search
code in the hope of being able to get the system up and running so
that fsck_ffs(8) can be used to get the filesystem cleaned up.

The following utilities have not been changed to search for
superblocks: clri(8), tunefs(8), snapinfo(8), fstyp(8), quot(8),
dump(8), fsirand(8), growfs(8), quotacheck(8), gjournal(8), and
glabel(8). When these utilities fail, they do report the cause of
the failure. The one exception is the tasting code used to try and
figure what a given disk contains. The tasting code will remain
silent so as not to put out a slew of messages as it trying to taste
every new mass storage device that shows up.

Reviewed by: kib
Reviewed by: Warner Losh
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D36053
Sponsored by: The FreeBSD Foundation


# b21582ee 30-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Add a flags parameter to the ffs_sbget() function that reads UFS superblocks.

Rather than trying to shoehorn flags into the requested superblock
address, create a separate flags parameter to the ffs_sbget()
function in sys/ufs/ffs/ffs_subr.c. The ffs_sbget() function is
used both in the kernel and in user-level utilities through export
to the sbget() function in the libufs(3) library (see sbget(3)
for details). The kernel uses ffs_sbget() when mounting UFS
filesystems, in the glabel(8) and gjournal(8) GEOM utilities,
and in the standalone library used when booting the system
from a UFS root filesystem.

The ffs_sbget() function reads the superblock located at the byte
offset specified by its sblockloc parameter. The value UFS_STDSB
may be specified for sblockloc to request that the standard
location for the superblock be read.

The two existing options are now flags:

UFS_NOHASHFAIL will note if the check hash is wrong but will still
return the superblock. This is used by the bootstrap code to
give the system a chance to come up so that fsck can be run to
correct the problem.

UFS_NOMSG indicates that superblock inconsistency error messages
should not be printed. It is used by programs like fsck that
want to print their own error message and programs like glabel(8)
that just want to know if a UFS filesystem exists on a partition.

One additional flag is added:

UFS_NOCSUM causes only the superblock itself to be returned, but does
not read in any auxiliary data structures like the cylinder group
summary information. It is used by clients like glabel(8) that
just want to check for possible filesystem types. Using UFS_NOCSUM
skips the superblock checks for csum data which allows superblocks
that have corrupted csum data to be read and used.

The validate_sblock() function checks that the superblock has not
been corrupted in a way that can crash or hang the system. Unless
the UFS_NOMSG flag is specified, it will print out any errors that
it finds. Prior to this commit, validate_sblock() returned as soon
as it found an inconsistency so would print at most one message.
It now does all its checks so when UFS_NOMSG has not been specified
will print out everything that it finds inconsistent.

Sponsored by: The FreeBSD Foundation


# 064e6b43 13-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Rewrite function definitions in the UFS/FFS code base with identifier lists.

The K&R style in UFS and other places in the tree's days are numbered
as this syntax is removed in C2x proposal N2432:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf

Though running to nearly 6000 lines of diffs this update should cause
no functional change to the code.

Requested by: Warner Losh
MFC after: 2 weeks


# f1b4324b 22-Jun-2022 Chuck Silvers <chs@FreeBSD.org>

ffs: fix vn_read_from_obj() usage for PAGE_SIZE > block size

vn_read_from_obj() requires that all pages of a vnode (except the last
partial page) be either completely valid or completely invalid,
but for file systems with block size smaller than PAGE_SIZE,
partially valid pages may exist anywhere in the file.
Do not enable the vn_read_from_obj() path in this case.

Reviewed by: mckusick, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34836


# ca7c2d2e 09-May-2022 Konstantin Belousov <kib@FreeBSD.org>

UFS: clear fs_fmod once more, in the buffer data copy.

This is needed for in-kernel copy of the code, where allocation might
happen after fs_fmod is cleared in ffs_sbput() but before the write.

Reported by: markj
Reviewed by: chs, markj
PR: 263765
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35149


# 4ac2df8f 08-May-2022 Konstantin Belousov <kib@FreeBSD.org>

ffs_use_bwrite: make the superblock snapshot more consistent

Copy in-memory struct fs to the superblock buffer under the UFS mutex.

Reviewed by: chs, markj
PR: 263765
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35149


# 3dc5f8e1 08-Apr-2022 Chuck Silvers <chs@FreeBSD.org>

ffs: wait for trims earlier during unmount to avoid panic

All softdep processing is supposed to be completed by
softdep_flushfiles() and no more deps are supposed to be created after
that, but if a pending trim completes after softdep_flushfiles() and
before softdep_unmount() then the blkfree that is performed by
ffs_blkfree_trim_task() will create a dep when none should exist, and
if softdep_unmount() is called before that dep is freed then the
kernel will panic. Prevent this by waiting for trims to complete
earlier in the unmount process, in ffs_flushfiles(), so that any deps
will be freed and any modified CG buffers will be flushed by the final
fsync of the devvp in ffs_flushfiles() as intended.

Reviewed by: mckusick, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34806


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# ab2dbd9b 16-Mar-2022 Robert Wing <rew@FreeBSD.org>

ffs_mount(): fix snapshotting

Commit 0455cc7104ec broke snapshotting for ffs. In that commit,
ffs_mount() was changed so the namei() lookup for a disk device happens
before ffs_snapshot(). This caused the issue where namei() would lookup
the snapshot file and fail because the file doesn't exist. Even if it did
exist, taking a snapshot would still fail since it's not a disk device.

Fix this by taking a snapshot of the filesystem as-is and return without
altering ro/rw or any other attributes that are passed in.

Reported by: pho
Reviewed by: mckusick
Fixes: 0455cc7104ec ("ffs_mount(): return early if namei() fails to lookup disk device")
Differential Revision: https://reviews.freebsd.org/D34562


# 0455cc71 07-Mar-2022 Robert Wing <rew@FreeBSD.org>

ffs_mount(): return early if namei() fails to lookup disk device

With soft updates enabled, an INVARIANTS panic is hit in ffs_unmount().

The problem occurs in ffs_mount() when upgrading a mount from ro->rw.
During a mount update, the soft update code gets set up but doesn't get
cleaned up if namei() fails when looking up the disk device.

Avoid this scenario by looking up the disk device first and bail early
if the namei() lookup fails.

PR: 256511
MFC After: 2 weeks
Reviewed by: mckusick, kib
Differential Revision: https://reviews.freebsd.org/D30870


# 303d3ae7 31-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

ufs, msdosfs: do not record witness order when creating vnode

When allocating new vnode, we need to lock it exclusively before
making it externally visible. Since other threads cannot observe the
vnode yet, current lock order cannot create LoR conditions.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34126


# 1fbcaa13 02-Jan-2022 Kirk McKusick <mckusick@FreeBSD.org>

When doing a read-only mount of a UFS filesystem using gjournal(8),
suppress error message about a missing gjournal provider.

Submitted by: Andreas Longwitz
MFC after: 2 weeks
Sponsored by: Netflix


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# c34a5148 16-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: fix newly introduced LOR between mntfs vnode lock and topology lock

The mntfs vnode lock should be before topology, as established in
ffs_mountfs(). Extend the locked region in ffs_unmount().

Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33013


# 9e9dcac9 15-Nov-2021 Kirk McKusick <mckusick@FreeBSD.org>

Allow forced r/w mount of UFS/FFS filesystem with a bad check hash.

Normally a UFS/FFS filesystem with a bad check hash can only be
mounted read only. With this commit the mount(8) -f (force) option
can be used to force a read-write mount of a UFS/FFS filesystem with
a bad check hash. Conveniently the filesystem will proceed to
update its on-disk superblock with a corrected check hash.

Sponsored by: Netflix


# 25809a01 01-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

mntfs: lock mntfs pseudo devfs vnode properly

Require devvp locked for mntfs_freevp(), to have it locked around
vgone(). Make that true for ffs, which is the only consumer of
the interface.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# 76b05e3e 01-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: Remove assertions about locked um_devvp in several places

Namely, ffs_blkfree_cg(), and ffs_flushfiles().

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# 2030ee0e 19-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

ufs: remove write-only variables

Mark variables as __diagused for invariant-only vars

Reviewed by: imp, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32577


# 9acea164 02-Oct-2021 Robert Wing <rew@FreeBSD.org>

ffs: retire unused fsckpid mount option

The fsckpid mount option was introduced in 927a12ae16433b50 along with a
couple sysctl's to support SU+J with snapshots. However, those sysctl's
were never used and eventually removed in f2620e9ceb3ede02.

There are no in-tree consumers of this mount option.

Reviewed by: mckusick, kib
Differential Revision: https://reviews.freebsd.org/D32015


# 440320b6 04-Sep-2021 Robert Wing <rew@FreeBSD.org>

ffs: remove unused thread argument from ffs_reload()

MFC After: 1 week
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D31127


# 8df4bc48 05-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

ufs rename: ensure that the result of ufs_checkpath() is stable

ufs_rename() calls ufs_checkpath() to ensure that the target directory
is not a child of the source. If not, rename would create a loop.
For instance:
source->X1->X2->target
and if source moved under target, we get corrupted filesystem.
Suppose that we initially have
source->X1 .... and X2->target
where X1 is not on path from root to X2. Then ufs_checkpath() accepts
the inodes, but there is nothing preventing parallel rename of X2 to become
under X1, after checkpath finished.

Ensure stability of ufs_checkpath() result by taking a per-mount sx in
ufs_rename right before ufs_checkpath() and till the end.

Reviewed by: chs, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 211ec9b7 17-Jul-2021 Jason A. Harmening <jah@FreeBSD.org>

FFS: remove ffs_fsfail_task

Now that dounmount() supports a dedicated taskqueue, we can simply call
it with MNT_DEFERRED directly from the failing context. This also
avoids blocking taskqueue_thread with a potentially-expensive unmount
operation.

Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016


# c746ed72 12-Jun-2021 Jason A. Harmening <jah@FreeBSD.org>

Allow stacked filesystems to be recursively unmounted

In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.

This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.

To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.

Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016


# f784da88 17-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Move mnt_maxsymlinklen into appropriate fs mount data structures

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-Note: struct mount layout
Differential revision: https://reviews.freebsd.org/D30325


# 2af934cc 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Assert that um_softdep is NULL on free(ump), i.e. softdep_unmount() was called

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# f776c54c 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_mount: when remounting ro->rw and sbupdate failed, cleanup softdeps

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 7c7a6681 28-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: clear MNT_SOFTDEP earlier when remounting rw to ro

Suppose that we remount rw->ro and in parallel some reader tries to
instantiate a vnode, e.g. during lookup. Suppose that softdep_unmount()
already started, but we did not cleared the MNT_SOFTDEP flag yet.
Then ffs_vgetf() calls into softdep_load_inodeblock() which accessed
destroyed hashes and freed memory.

Set/clear fs_ronly simultaneously (WRT to files flush) with MNT_SOFTDEP.
It might be reasonable to move the change of fs_ronly to under MNT_ILOCK,
but no readers take it.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# fd97fa64 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Add FFSV_FORCEINODEDEP flag for ffs_vgetf()

It will be used to allow SU flush code to sync the volume while external
consumers see that SU is already disabled on the filesystem. Use it where
ffs_vgetf() called by SU code to process dependencies.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 2bfd8992 14-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

vnode: move write cluster support data to inodes.

The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679


# 89fd61d9 28-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Merge ufs_fhtovp() into ffs_inotovp().

The function alone was not used for anything but ffs_fstovp() for long time.

Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 5952c86c 26-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_inotovp(): interface to convert (ino, gen) into alive vnode

It generalizes the VFS_FHTOVP() interface, making it possible to fetch
the inode without faking filehandle. Also it adds the ffs flags argument
which allows to control ffs_vgetf() call.

Requested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# f16c26b1 26-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: Add FFSV_REPLACE_DOOMED flag to ffs_vgetf()

It specifies that caller requests a fresh non-doomed vnode. If doomed
vnode is found in the hash, it should behave similarly to FFSV_REPLACE.
Or, to put it differently, the flag is same as FFSV_REPLACE, but only
when the found hashed vnode is doomed.

Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# bf0db193 29-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

buf SU hooks: track buf_start() calls with B_IOSTARTED flag

and only call buf_complete() if previously started. Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by: pho
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundations


# a63eae65 30-Jan-2021 Kirk McKusick <mckusick@FreeBSD.org>

Revert 2d4422e7991a, Eliminate lock order reversal in UFS ffs_unmount().

After discussion with Chuck Silvers (chs@) we have decided that
there is a better way to resolve this lock order reversal which
will be committed separately.

Sponsored by: Netflix


# 79a5c790 15-Jan-2021 Kirk McKusick <mckusick@FreeBSD.org>

Eliminate a locking panic when cleaning up UFS snapshots after a
disk failure.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.

Sponsored by: Netflix


# 2d4422e7 11-Jan-2021 Kirk McKusick <mckusick@FreeBSD.org>

Eliminate lock order reversal in UFS ffs_unmount().

UFS uses a new "mntfs" pseudo file system which provides private
device vnodes for a file system to safely access its disk device.
The original device vnode is saved in um_odevvp to hold the exclusive
lock on the device so that any attempts to open it for writing will
fail. But it is otherwise unused and has its BO_NOBUFS flag set to
enforce that file systems using mntfs vnodes do not accidentally
use the original devfs vnode. When the file system is unmounted,
um_odevvp is no longer needed and is released.

The lock order reversal happens because device vnodes must be locked
before UFS vnodes. During unmount, the root directory vnode lock
is held. When when calling vrele() on um_odevvp, vrele() attempts to
exclusive lock um_odevvp causing the lock order reversal. The problem
is eliminated by doing a non-blocking exclusive lock on um_odevvp
which will always succeed since there are no users of um_odevvp.
With um_odevvp locked, it can be released using vput which does not
attempt to do a blocking exclusive lock request and thus avoids the
lock order reversal.

Sponsored by: Netflix


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# 8a1509e4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Handle LoR in flush_pagedep_deps().

When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock. Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed. This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps(). The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed. All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP. For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied. Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes. [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged. A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive(). In this case, dropping the
vnode lock is not safe. Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT. ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 61846fc4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a framework that tracks exclusive vnode lock generation count for UFS.

This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored. Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents. If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter. Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock. It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 996d40f9 24-Oct-2020 Kirk McKusick <mckusick@FreeBSD.org>

Various new check-hash checks have been added to the UFS filesystem
over various major releases. Superblock check hashes were added for
the 12 release and cylinder-group and inode check hashes will appear
in the 13 release.

When a disk with a UFS filesystem is writably mounted, the kernel
clears the feature flags for anything that it does not support. For
example, if a UFS disk from a 12-stable kernel is mounted on an
11-stable system, the 11-stable kernel will clear the flag in the
filesystem superblock that indicates that superblock check-hashs
are being maintained. Thus if the disk is later moved back to a
12-stable system, the 12-stable system will know to ignore its
incorrect check-hash.

If the only filesystem modification done on the earlier kernel is
to run a utility such as growfs(8) that modifies the superblock but
neither updates the check-hash nor clears the feature flag indicating
that it does not support the check-hash, the disk will fail to mount
if it is moved back to its original newer kernel.

This patch moves the code that clears the filesystem feature flags
from the mount code (ffs_mountfs()) to the code that reads the
superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount
code and is imported into libufs(3), all the filesystem utilities
will now also clear these flags when they make modifications to the
filesystem.

As suggested by John Baldwin, fsck_ffs(8) has been changed to accept
and repair bad superblock check-hashes rather than refusing to run.
This change allows fsck to recover filesystems that have been impacted
by utilities older than those created after this change and is a
sensible thing to do in any event.

Reported by: John Baldwin (jhb@)
MFC after: 2 weeks
Sponsored by: Netflix


# e1ef4c29 08-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not leak B_BARRIER.

Normally when a buffer with B_BARRIER is written, the flag is cleared
by g_vfs_strategy() when creating bio. But in some cases FFS buffer
might not reach g_vfs_strategy(), for instance when copy-on-write
reports an error like ENOSPC. In this case buffer is returned to
dirty queue and might be written later by other means. Among then
bdwrite() reasonably asserts that B_BARRIER is not set.

In fact, the only current use of B_BARRIER is for lazy inode block
initialization, where write of the new inode block is fenced against
cylinder group write to mark inode as used. The situation could be
seen that we break dependency by updating cg without written out
inode. Practically since CoW was not able to find space for a copy of
inode block, for the same reason cg group block write should fail.

Reported by: pho
Discussed with: chs, imp, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26511


# d90f2c36 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: clean up empty lines in .c and .h files


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# a92a971b 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the thread argument from vget

It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)


# 03337743 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: clean MNTK_FPLOOKUP if MNT_UNION is set

Elides checking it during lookup.


# 9d5a594f 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add support for lockless lookup

ACLs are not supported, meaning their presence will force the use of the old lookup.

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25579


# 93440bbe 18-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

The binary representation of the superblock (the fs structure) is written
out verbatim to the disk: see ffs_sbput() in sys/ufs/ffs/ffs_subr.c.
It contains a pointer to the fs_summary_info structure. This pointer
value inadvertently causes garbage to be stored. It is garbage because
the pointer to the fs_summary_info structure is the address the then
current stack or heap. Although a mere pointer does not reveal anything
useful (like a part of a private key) to an attacker, garbage output
deteriorates reproducibility.

This commit zeros out the pointer to the fs_summary_info structure
before writing the out the superblock.

Reviewed by: kib
Tested by: Peter Holm
PR: 246983
Sponsored by: Netflix


# 34816cb9 18-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Move the pointers stored in the superblock into a separate
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.

This cleanup of the superblock eases the task of the commit
that immediately follows this one.

Suggested by: brooks
Reviewed by: kib
PR: 246983
Sponsored by: Netflix


# d9a8abf6 17-Jun-2020 Chuck Silvers <chs@FreeBSD.org>

Move all of the functions in ffs_subr.c that are only used by the ufs kernel
module from that file into ffs_vfsops.c. This fixes the build for kernel
configs that don't include FFS.

PR: 247256
Submitted by: glebius
Reviewed by: mckusick (earlier version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25285


# 1f7104d7 13-Jun-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix export_args ex_flags field so that is 64bits, the same as mnt_flags.

Since mnt_flags was upgraded to 64bits there has been a quirk in
"struct export_args", since it hold a copy of mnt_flags
in ex_flags, which is an "int" (32bits).
This happens to currently work, since all the flag bits used in ex_flags are
defined in the low order 32bits. However, new export flags cannot be defined.
Also, ex_anon is a "struct xucred", which limits it to 16 additional groups.
This patch revises "struct export_args" to make ex_flags 64bits and replaces
ex_anon with ex_uid, ex_ngroups and ex_groups (which points to a
groups list, so it can be malloc'd up to NGROUPS in size.
This requires that the VFS_CHECKEXP() arguments change, so I also modified the
last "secflavors" argument to be an array pointer, so that the
secflavors could be copied in VFS_CHECKEXP() while the export entry is locked.
(Without this patch VFS_CHECKEXP() returns a pointer to the secflavors
array and then it is used after being unlocked, which is potentially
a problem if the exports entry is changed.
In practice this does not occur when mountd is run with "-S",
but I think it is worth fixing.)

This patch also deleted the vfs_oexport_conv() function, since
do_mount_update() does the conversion, as required by the old vfs_cmount()
calls.

Reviewed by: kib, freqlabs
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D25088


# d79ff54b 25-May-2020 Chuck Silvers <chs@FreeBSD.org>

This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.

- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088


# 71f26429 09-Apr-2020 Konstantin Belousov <kib@FreeBSD.org>

ufs: apply suspension for non-forced rw unmounts.

Forced rw unmounts and remounts from rw to ro already suspend
filesystem, which closes races with writers instantiating new vnodes
while unmount flushes the queue. Original intent of not including
non-forced unmounts into this regime was to allow such unmounts to
fail if writer was active, but this did not worked well.

Similar change, but causing all unmount, even involving only ro
filesystem, were proposed in D24088, but I believe that suspending ro
is undesirable, and definitely spends CPU time.

Reported by: markj
Discussed with: chs, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f15ccf88 06-Mar-2020 Chuck Silvers <chs@FreeBSD.org>

Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by: kib, mckusick
Approved by: imp (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23787


# 13532153 16-Feb-2020 Scott Long <scottl@FreeBSD.org>

Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command. Add complimentary support to the CAM periphs that
support it. This is a redo of r357710.


# 85eb41f7 10-Feb-2020 Scott Long <scottl@FreeBSD.org>

Revert r357710 and 357711 until they can be debugged


# 7d99bda7 09-Feb-2020 Scott Long <scottl@FreeBSD.org>

Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command. Add complimentary support to the CAM periphs that
support it.


# 6c44a3e0 25-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add vgone calls for unconstructed vnodes in the error path

This mostly eliminates the requirement that vput never unlocks the vnode
before calling VOP_INACTIVE. Note it may still be present for other
filesystems.

See r356126 for an example bug.

Note vput stopped doing early unlock in r357070 thus this change does
not affect correctness as it is.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23215


# 0297c138 14-Jan-2020 Kirk McKusick <mckusick@FreeBSD.org>

When sync'ing a mount point, the mount point's vnodes were scanned
twice. Once to update the changed inodes, and a second time to update
changed quota information. This change merges these two scans into a
single scan which does both inode and quota updates.

MFC after: 7 days


# 80663cad 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: use lazy list instead of active list for syncer

Quota code is temporarily regressed to do a full vnode scan.

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22996


# ac4ec141 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add a setter for inode i_flag field

This will be used later to add vnodes to the lazy list.

Reviewed by: kib (previous version), jeff
Tested by: pho (in a larger patch)
Differential Revision: https://reviews.freebsd.org/D22994


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# e35cd9e3 06-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add root vnode caching

See r353150.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21646


# 44d37182 03-Oct-2019 Kirk McKusick <mckusick@FreeBSD.org>

Update ffs_getcg() function to accept a flags parameter to be passed
to breadn_flags() in preparation for later need when doing forcible
unmount when disk dies or is removed.

No functional change.

Sponsored by: Netflix


# f3cf6225 06-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

ufs: Remove redundant brelse() after r294954

Same automation.

No functional change.


# 16040222 29-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

UFS: stop reusing the vnode for reallocated inode.

In ffs_valloc(), force reclaim existing vnode on inode reuse, instead
of trying to re-initialize the same vnode for new purposes. This is
done in preparation of changes to the vp->v_object lifecycle handling.

A new FFSV_REPLACE flag to ffs_vgetf() directs the function to
vgone(9) the vnode if found in vfs hash, instead of returning it.

Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412


# 9454b4fd 06-Aug-2019 Kirk McKusick <mckusick@FreeBSD.org>

A race condition existed between the time a UFS/FFS superblock check
hash was computed and the time that the superblock was copied to a
buffer to be written to disk. The result was a failed superblock
check hash the next time that the superblock was read.

The fix is to compute the check hash after the superblock has been
copied to a buffer to be written.

PR: 236504
Reported by: Peter Holm
Tested by: Peter Holm
Sponsored by: Netflix


# fdf34aa3 17-Jul-2019 Kirk McKusick <mckusick@FreeBSD.org>

The error reported in FS-14-UFS-3 can only happen on UFS/FFS
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.

A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.

This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.

Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by: kib
Sponsored by: Netflix


# daba4da8 01-Jul-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add a new "untrusted" option to the mount command. Its purpose
is to notify the kernel that the file system is untrusted and it
should use more extensive checks on the file-system's metadata
before using it. This option is intended to be used when mounting
file systems from untrusted media such as USB memory sticks or other
externally-provided media.

It will initially be used by the UFS/FFS file system, but should
likely be expanded to be used by other file systems that may appear
on external media like msdosfs, exfat, and ext2fs.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20786


# f89d2072 17-Jun-2019 Xin LI <delphij@FreeBSD.org>

Separate kernel crc32() implementation to its own header (gsb_crc32.h) and
rename the source to gsb_crc32.c.

This is a prerequisite of unifying kernel zlib instances.

PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193


# daec9284 21-May-2019 Conrad Meyer <cem@FreeBSD.org>

Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon


# 5ffc99e2 08-Apr-2019 Konstantin Belousov <kib@FreeBSD.org>

Handle races when remounting UFS volume from ro to rw.

In particular, ensure that writers are not unleashed before SU
structures are initialized. Also, correctly handle MNT_ASYNC before
this.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 72d28f97 15-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

Reorder ffs_verify_dinode_ckhash() so that it checks the inode check-hash
before copying in the inode so that the mode and link-count are not set
if the check-hash fails. This change ensures that the vnode will be properly
unwound and recycled rather than being held in the cache.

Initialize the file mode is zero so that if the loading of the inode
fails (for example because of a check-hash failure), the vnode will be
properly unwound and recycled.

Reported by: Gary Jennejohn (gj)
Sponsored by: Netflix


# 8f829a5c 11-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS. This change adds a
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.

Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.

Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# fb14e73c 05-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

Normally when an attempt is made to mount a UFS/FFS filesystem whose
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.

This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.

Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.

Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.

Reported by: Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by: Netflix


# a02bd3e3 25-Nov-2018 Kirk McKusick <mckusick@FreeBSD.org>

Move the check for the filesystem having been run on a kernel that
predates metadata check hashes so that it is done before deciding
whether to compute a check-hash of the superblock.

Reported by: Rick Macklem <rmacklem@uoguelph.ca>
Sponsored by: Netflix


# 9fc5d538 13-Nov-2018 Kirk McKusick <mckusick@FreeBSD.org>

In preparation for adding inode check-hashes, clean up and
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.

Convert the utilities that had been using the undocumented
interface to use the new documented interface.

No functional change (as for now the libufs library does not
do inode check-hashes).

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# ec888383 23-Oct-2018 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to the superblock. If a check hash fails when an attempt
is made to mount a filesystem, the mount fails with EINVAL (Invalid
argument). This avoids a class of filesystem panics related to
corrupted superblocks. The hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 7e038bc2 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by: Peter Holm
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix


# cc91864c 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.


# 68c49bcc 06-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix


# ab0bcb60 29-Jun-2018 Kirk McKusick <mckusick@FreeBSD.org>

Create um_flags in the ufsmount structure to hold flags for a UFS filesystem.
Convert integer structure flags to use um_flags:

int um_candelete; /* devvp supports TRIM */
int um_writesuspended; /* suspension in progress */

become:

#define UM_CANDELETE 0x00000001 /* devvp supports TRIM */
#define UM_WRITESUSPENDED 0x00000002 /* suspension in progress */

This is in preparation for adding other flags to indicate forcible
unmount in progress after a disk failure and possibly forcible
downgrade to read-only.

No functional change intended.

Sponsored by: Netflix


# efbf3964 01-Mar-2018 Kirk McKusick <mckusick@FreeBSD.org>

This change is some refactoring of Mark Johnston's changes in r329375
to fix the memory leak that I introduced in r328426. Instead of
trying to clear up the possible memory leak in all the clients, I
ensure that it gets cleaned up in the source (e.g., ffs_sbget ensures
that memory is always freed if it returns an error).

The original change in r328426 was a bit sparse in its description.
So I am expanding on its description here (thanks cem@ and rgrimes@
for your encouragement for my longer commit messages).

In preparation for adding check hashing to superblocks, r328426 is
a refactoring of the code to get the reading/writing of the superblock
into one place. Unlike the cylinder group reading/writing which
ends up in two places (ffs_getcg/ffs_geom_strategy in the kernel
and cgget/cgput in libufs), I have the core superblock functions
just in the kernel (ffs_sbfetch/ffs_sbput in ffs_subr.c which is
already imported into utilities like fsck_ffs as well as libufs to
implement sbget/sbput). The ffs_sbfetch and ffs_sbput functions
take a function pointer to do the actual I/O for which there are
four variants:

ffs_use_bread / ffs_use_bwrite for the in-kernel filesystem

g_use_g_read_data / g_use_g_write_data for kernel geom clients

ufs_use_sa_read for the standalone code (stand/libsa/ufs.c
but not stand/libsa/ufsread.c which is size constrained)

use_pread / use_pwrite for libufs

Uses of these interfaces are in the UFS filesystem, geoms journal &
label, libsa changes, and libufs. They also permeate out into the
filesystem utilities fsck_ffs, newfs, growfs, clri, dump, quotacheck,
fsirand, fstyp, and quot. Some of these utilities should probably be
converted to directly use libufs (like dumpfs was for example), but
there does not seem to be much win in doing so.

Tested by: Peter Holm (pho@)


# 16759360 16-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Fix a memory leak introduced in r328426.

ffs_sbget() may return a superblock buffer even if it fails, so the
caller must be prepared to free it in this case. Moreover, when tasting
alternate superblock locations in a loop, ffs_sbget()'s readfunc
callback must free the previously allocated buffer.

Reported and tested by: pho
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D14390


# 068beacf 08-Feb-2018 Kirk McKusick <mckusick@FreeBSD.org>

The goal of this change is to prevent accidental foot shooting by
folks running filesystems created on check-hash enabled kernels
(which I will call "new") on a non-check-hash enabled kernels (which
I will call "old). The idea here is to detect when a filesystem is
run on an old kernel and flag the filesystem so that when it gets
moved back to a new kernel, it will not start getting a slew of
check-hash errors.

Back when the UFS version 2 filesystem was created, it added a file
flag FS_INDEXDIRS that was to be set on any filesystem that kept
some sort of on-disk indexing for directories. The idea was precisely
to solve the issue we have today. Specifically that a newer kernel
that supported indexing would be able to tell that the filesystem
had been run on an older non-indexing kernel and that the indexes
should not be used until they had been rebuilt. Since we have never
implemented on-disk directory indicies, the FS_INDEXDIRS flag is
cleared every time any UFS version 2 filesystem ever created is
mounted for writing.

This commit repurposes the FS_INDEXDIRS flag as the FS_METACKHASH
flag. Thus, the FS_METACKHASH is definitively known to have always
been cleared. The FS_INDEXDIRS flag has been moved to a new block
of flags that will always be cleared starting with this commit
(until they get used to implement some future feature which needs
to detect that the filesystem was mounted on a kernel that predates
the new feature).

If a filesystem with check-hashes enabled is mounted on an old
kernel the FS_METACKHASH flag is cleared. When that filesystem is
mounted on a new kernel it will see that the FS_METACKHASH has been
cleared and clears all of the fs_metackhash flags. To get them
re-enabled the user must run fsck (in interactive mode without the
-y flag) which will ask for each supported check hash whether it
should be rebuilt and enabled. When fsck is run in its default preen
mode, it will just ignore the check hashes so they will remain
disabled.

The kernel has always disabled any check hash functions that it
does not support, so as more types of check hashes are added, we
will get a non-surprising result. Specifically if filesystems get
moved to kernels supporting fewer of the check hashes, those that
are not supported will be disabled. If the filesystem is moved back
to a kernel with more of the check-hashes available and fsck is run
interactively to rebuild them, then their checking will resume.
Otherwise just the smaller subset will be checked.

A side effect of this commit is that filesystems running with
cylinder-group check hashes will stop having them checked until
fsck is run to re-enable them (since none of them currently have
the FS_METACKHASH flag set). So, if you want check hashes enabled
on your filesystems after booting a kernel with these changes, you
need to run fsck to enable them. Any newly created filesystems will
have check hashes enabled. If in doubt as to whether you have check
hashes emabled, run dumpfs and look at the list of enabled flags
at the end of the superblock details.


# 47806d1b 05-Feb-2018 Kirk McKusick <mckusick@FreeBSD.org>

Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
never mark it dirty, thus do not write it out, and hence never
update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
the old check hash is still in the VM cache, but some other pages
are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

- bufdone()
-- b_ckhashcalc (calculates check hash)
-- bufdone_finish()
--- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
in memory (but not all pages, or we wouldn't be doing the read),
the I/O is done with bogus_page mapped in for the pages that exist
in the VM cache. This mapping is done to avoid corrupting the
cached pages if there is any I/O overrun. The vfs_vmio_iodone()
function is responsible for replacing the bogus_page(s) with the
cached ones. But we were calculating the check hash before the
bogus_page(s) were replaced. Hence, when we were calculating the
check hash, we were partly reading from bogus_page, which means
we calculated a bad check hash (e.g., because multiple pages have
been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner


# dffce215 25-Jan-2018 Kirk McKusick <mckusick@FreeBSD.org>

Refactoring of reading and writing of the UFS/FFS superblock.
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.

No functional change intended.

Reviewed by: kib


# 377f88fb 09-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Postpone the disassotiation of the background write buffer with devvp
so that buf_complete() sees fully constructed buffer.

This is a NOP right now, but will be needed by the forthcoming SU change.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 75e3597a 21-Sep-2017 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.

sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)

sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)

sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)


# 9c4f551e 28-Jun-2017 Kirk McKusick <mckusick@FreeBSD.org>

Create a new function ffs_getcg() to read in and verify a cylinder
group. Change all code points that open-coded this functionality
to use the new function. This commit is a refactoring with no
change in functionality.

In the future this change allows more robust checking of cylinder
group reads along the lines discussed in the hardening UFS session
at BSDCan (retry I/O, add checksums, etc). For more detail see the
session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS

Reviewed by: kib


# 4cbc378c 03-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Clean possible td_su reference on the struct mount being unmounted as
the last step of ffs_unmount().

It is possible that the mount point is recorded for cleanup in AST
context while softdep flush is executed during unmount. The workitems
are flushed by other means for the unmount, but the stray reference to
struct mount blocks destruction of mount. Check for the situation and
manually call vfs_rel() before returning from ffs_unmount().

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 9ed01c32 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.


# a96da1c3 04-Apr-2017 Conrad Meyer <cem@FreeBSD.org>

ufs: Export UFS_MAXNAMLEN to pathconf, statfs

Rather than the global NAME_MAX constant. This change is required to
support systems with a NAME_MAX/MAXNAMLEN that differs from UFS_MAXNAMLEN.

This was missed in r313475 due to the alternative spelling ("NAME_MAX") of
MAXNAMLEN. This change is also similar in spirit to r313780.

Reported by: ngie@
Sponsored by: Dell EMC Isilon


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 1dc349ab 15-Feb-2017 Ed Maste <emaste@FreeBSD.org>

prefix UFS symbols with UFS_ to reduce namespace pollution

Specifically:
ROOTINO -> UFS_ROOTINO
WINO -> UFS_WINO
NXADDR -> UFS_NXADDR
NDADDR -> UFS_NDADDR
NIADDR -> UFS_NIADDR
MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by: kib, mckusick
Obtained from: NetBSD
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9536


# 714b7df5 13-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Provide simple mutual exclusion between mount point update and unmount.

Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call. This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.

Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected. Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.

In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# ad544726 28-Oct-2016 Kirk McKusick <mckusick@FreeBSD.org>

Avoid possible overflow when calclating malloc size for auxillary
data structure sizes when mounting and reloading UFS/FFS
filesystems by using a u_long rather than an int for the size.

Reported by: Mariusz Zaborski <oshogbo@>
MFC after: 1 week


# 8660b707 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the __bo_vnode field from struct vnode

The pointer can be obtained using __containerof instead.

Reviewed by: kib


# e1db6897 17-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Reduce size of ufs inode.

Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding. To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2. Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by: pho (previous version)
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# df426577 07-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Partially lift suspension when ffs_reload() finished with cgs and
going to re-read inodes.

Secondary write initiators, e.g. ufs_inactive(), might need to start a
write while owning the vnode lock. Since the suspended state
established by /dev/ufssuspend prevents them from entering
vn_start_secondary_write(), we get deadlock otherwise.

Note that it is arguably not very useful to re-read inodes after
/dev/ufssuspend suspension, because the suspension does not block
readers, and other threads might read existing files in parallel with
suspension owner (for now, only growfs(8)) operations. This
effectively means that suspension owner cannot safely modify existing
inodes, and then there is no sense in re-reading. But keep the code
enabled for now.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 57d2ac2f 22-May-2016 Kevin Lo <kevlo@FreeBSD.org>

arc4random() returns 0 to (2**32)−1, use an alternative to initialize
i_gen if it's zero rather than a divide by 2.

With inputs from delphij, mckusick, rmacklem

Reviewed by: mckusick


# 5f36cc5b 21-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Stop dropping and reacquiring Giant around geom calls in UFS.

Sponsored by: The FreeBSD Foundation


# c70b3cd2 21-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Improve handling of rdev->si_mountpt on mount and unmount of FFS
volumes. Treat the field as a semaphore protecting availability of
the device for mounting. Do no access devvp->v_rdev without the vnode
lock owned.

Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.

Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# bfe89a8e 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Do enable io accounting for read-only mounts and mounts which are
remounted to writeable after initial read-only. Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.

Based on submission by: bde
MFC after: 1 week


# 98cbffd7 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix comments.

Submitted by: bde
MFC after: 1 week


# 08e94183 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

UFS: spelling fixes on comments.

No functional change.


# c79dff0f 27-Mar-2016 Konstantin Belousov <kib@FreeBSD.org>

Split the global taskqueue used to process all UFS trim completions,
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.

But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes. Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.

Reported by: Nick Evans <nevans@talkpoint.com>
Tested by: Nick Evans <nevans@talkpoint.com>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# abe53f7e 27-Jan-2016 Kirk McKusick <mckusick@FreeBSD.org>

This fixes a bug in UFS2 exported NFS volumes. An NFS client can
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.

Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib


# 43a993bb 29-Nov-2015 Kirk McKusick <mckusick@FreeBSD.org>

For performance reasons, it is useful to have a single string used as
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.

Reviewed by: kib
Tested by: Peter Holm


# fade8dd7 23-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.

In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division


# 5f34e93c 05-Jul-2015 Mark Johnston <markj@FreeBSD.org>

Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by: kib, mjg
Differential Revision: https://reviews.freebsd.org/D2937


# 521987c3 29-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

Simplify code, no need to test the flag before clearing it.

Submitted by: ed
MFC after: 12 days


# b2c3df84 27-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

Handle errors from background write of the cylinder group blocks.

First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks


# 1eabd967 16-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active. This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list. Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation. In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list. When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 69baeadc 29-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros. It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review: https://reviews.freebsd.org/D2665
Reviewed by: rodrigc
Looked at by: bjk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 74a87c38 24-Apr-2015 Kirk McKusick <mckusick@FreeBSD.org>

Limit the number of cylinder groups that will be searched when
trying to build a cluster. The limit is tunable using the sysctl
vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups
per block allocation. It was previously limited to the number of
cylinder groups in the filesystem per block allocation. When there
were no clusters of the needed size left, it repeatedly searched
the whole filesystem for a non-existent cluster on every block
allocation. The result was very slow filesystem allocation with
100% CPU utilization. The old behavior can be had by setting
vfs.ffs.maxclustersearch to a huge number (1,000,000).

This change affects only the layout policy routines so is not able
to interfere with the integrity of the filesystem.

Reported by: Dmitry Sivachenko (demon@)
Tested by: Dmitry Sivachenko (demon@)
MFC after: 2 weeks


# dda11d4a 15-Apr-2015 Rick Macklem <rmacklem@FreeBSD.org>

File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by: kib
MFC after: 2 weeks


# 4af9f77e 27-Mar-2015 Konstantin Belousov <kib@FreeBSD.org>

Fix the hand after the immediate reboot when the following command
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init

Hang occurs on the rootfs unmount. There are two issues:

1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0. FFS is suspended during
unmount, so unmount just hangs.

2. As noted above, the inodedep is linked eventually. It is not
linked until the superblock is written. But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed. It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock. Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.

Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.

There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.

Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons. Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.

PR: 195458
In collaboration with: gjb, pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 4fce16e4 20-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after: 2 weeks


# 4ce90426 20-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Correct the test for condition to suspend UFS filesystem during
unmount. There is no need to suspend read-only filesystem, while we
need suspension on modificable mount point.

Reported by: rwatson
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 895b3782 14-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt(). Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e9838c11 03-Jun-2014 John-Mark Gurney <jmg@FreeBSD.org>

don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags..
If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2,
you'll end up corrupting your FS as the superblock is updated and written
to a different location...

makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing
this issue...

Reviewed by: silience from mckusick
MFC after: 1 week


# 4896af9f 01-Mar-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: small formatting fixes.

Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after: 3 days
Reviewed by: mckusick


# cf058082 21-Oct-2013 Brooks Davis <brooks@FreeBSD.org>

Allow kernels without options SOFTUPDATES to build. This should fix the
embedded tinderboxes.

Reviewed by: emaste


# 519e3c3b 20-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

Third of several cleanups to soft dependency implementation.
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# cc3d8c35 09-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

There are several code sequences like
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 5f3a9c40 02-Jul-2013 Pedro F. Giffuni <pfg@FreeBSD.org>

Style fix: spaces.

Cleanup the incomplete revert.

Reported by: bde
MFC after: 4 weeks


# 9a0aea46 01-Jul-2013 Pedro F. Giffuni <pfg@FreeBSD.org>

Change i_gen in UFS to an unsigned type.

Revert the simplification of the i_gen calculation.
It is still a good idea to avoid zero values and for the case
of old filesystems there is probably no advantage in using
the complete 32 bits anyways.

Discussed with: bde
MFC after: 4 weeks


# bcb2f550 01-Jul-2013 Pedro F. Giffuni <pfg@FreeBSD.org>

Change i_gen in UFS to an unsigned type.

Further simplify the i_gen calculation for older disks.
Having a zero here is not really a problem and this is more
similar to what is done in newfs_random().

Reported by: Xin Li
MFC after: 4 weeks


# eee4072f 30-Jun-2013 Pedro F. Giffuni <pfg@FreeBSD.org>

Change i_gen in UFS to an unsigned type.

In UFS, i_gen is a random generated value and there is not way for
it to be negative. Actually, the value of i_gen is just used to
match bit patterns and it is of not consequence if the values are
signed or not.

Following other filesystems, set it to unsigned and use it as such,

Discussed by: mckusick
Reviewed by: mckusick (previous version)
MFC after: 4 weeks


# 22a72260 30-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# 26089666 06-Apr-2013 Jeff Roberson <jeff@FreeBSD.org>

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


# 59a01b70 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

UFS support of the unmapped i/o for the user data buffers.

Sponsored by: The FreeBSD Foundation
Tested by: pho, scottl, jhb, bf


# ba05dec5 27-Feb-2013 Konstantin Belousov <kib@FreeBSD.org>

The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks


# ddd6b3fc 10-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# c6e0355c 19-Nov-2012 Attilio Rao <attilio@FreeBSD.org>

r16312 is not any longer real since many years (likely since when VFS
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.

Removes comments that makes no sense now.

Reviewed by: kib
MFC after: 3 days


# 1848286a 18-Nov-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add UFS writesuspension mechanism, designed to allow userland processes
to modify on-disk metadata for filesystems mounted for write.

Reviewed by: kib, mckusick
Sponsored by: FreeBSD Foundation


# bc2258da 09-Nov-2012 Attilio Rao <attilio@FreeBSD.org>

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


# f7a3729c 22-Jul-2012 Kevin Lo <kevlo@FreeBSD.org>

Use NULL instead of 0 for pointers


# b569050a 30-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Enable vn_io_fault() lock avoidance for UFS.

Tested by: pho
MFC after: 2 months


# 72b8ff1c 21-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix use-after-free introduced in r234036.

Reviewed by: mckusick
Tested by: pho


# dca5e0ec 20-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 71469bb3 17-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 2b028c25 08-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix panic in ffs_reload(), which may happen when read-only filesystem
gets resized and then reloaded.

Reviewed by: kib, mckusick (earlier version)
Sponsored by: The FreeBSD Foundation


# b73ffa31 08-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Drop an unnecessary setting of si_mountpt when updating a UFS mount point.
Clearly it must have been set when the mount was done.

Reviewed by: kib


# 1faacf5d 28-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib


# ea573a50 28-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Do trivial reformatting of the comment to record the missed commit
message for r233609:
Restore the writes of atimes, quotas and superblock from syncer vnode.

Noted by: rdivacky


# a988a5c6 28-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Reviewed by: bde, mckusick
Tested by: pho
MFC after: 2 weeks


# e0c17408 28-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Update comment.

MFC after: 3 days


# 75a58389 24-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib


# 19c87af0 07-Feb-2012 Kirk McKusick <mckusick@FreeBSD.org>

In the original days of BSD, a sync was issued on every filesystem
every 30 seconds. This spike in I/O caused the system to pause every
30 seconds which was quite annoying. So, the way that sync worked
was changed so that when a vnode was first dirtied, it was put on
a 30-second cleaning queue (see the syncer_workitem_pending queues
in kern/vfs_subr.c). If the file has not been written or deleted
after 30 seconds, the syncer pushes it out. As the syncer runs once
per second, dirty files are trickled out slowly over the 30-second
period instead of all at once by a call to sync(2).

The one drawback to this is that it does not cover the filesystem
metadata. To handle the metadata, vfs_allocate_syncvnode() is called
to create a "filesystem syncer vnode" at mount time which cycles
around the cleaning queue being sync'ed every 30 seconds. In the
original design, the only things it would sync for UFS were the
filesystem metadata: inode blocks, cylinder group bitmaps, and the
superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from
which the filesystem is mounted).

Somewhere in its path to integration with FreeBSD the flushing of
the filesystem syncer vnode got changed to sync every vnode associated
with the filesystem. The result of this change is to return to the
old filesystem-wide flush every 30-seconds behavior and makes the
whole 30-second delay per vnode useless.

This change goes back to the originally intended trickle out sync
behavior. Key to ensuring that all the intended semantics are
preserved (e.g., that all inode updates get flushed within a bounded
period of time) is that all inode modifications get pushed to their
corresponding inode blocks so that the metadata flush by the
filesystem syncer vnode gets them to the disk in a timely way.
Thanks to Konstantin Belousov (kib@) for doing the audit and commit
-r231122 which ensures that all of these updates are being made.

Reviewed by: kib
Tested by: scottl
MFC after: 2 weeks


# cc672d35 16-Jan-2012 Kirk McKusick <mckusick@FreeBSD.org>

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


# b60ee81e 14-Jan-2012 Kirk McKusick <mckusick@FreeBSD.org>

Convert FFS mount error messages from kernel printf's to using the
vfs_mount_error error message facility provided by the nmount
interface.

Clean up formatting of mount warnings which still need to use
kernel printf's since they do not return errors.

Requested by: Craig Rodrigues <rodrigc@crodrigues.org>
MFC after: 2 weeks


# fddf7bae 29-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)


# d716efa9 24-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)


# 927a12ae 15-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.


# 8795189c 09-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Allow disk partitions associated with UFS read-only mounted
filesystems to be opened for writing. This functionality used to
be special-cased for just the root filesystem, but with this change
is now available for all UFS filesystems. This change is needed for
journaled soft updates recovery.

Discussed with: Jeff Roberson


# 9420dc62 12-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.


# 280e091a 10-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# 694a586a 21-May-2011 Rick Macklem <rmacklem@FreeBSD.org>

Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by: kib


# 16b1f68d 20-Mar-2011 Konstantin Belousov <kib@FreeBSD.org>

Retire opt_ffs_broken_fixme.h.
Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with
usual layering.

Requested by: bde
MFC after: 1 week


# 8c2a54de 28-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Add kernel side support for BIO_DELETE/TRIM on UFS.

The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.

Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.

Based on the patch by: mckusick
Reviewed by: mckusick, pjd
Tested by: pho
MFC after: 1 month


# fddd463d 01-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Journal start looks up .sujournal file by doing lookup on the root dvp.
As result, failed softdep_mount() might leave up to two vnodes on the
mp mountlist, preventing mnt_ref from going to zero.

Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.

Initial report by: Garrett Cooper
Reproduced and tested by: pho


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# d0cc54f3 10-Oct-2010 Konstantin Belousov <kib@FreeBSD.org>

The r184588 changed the layout of struct export_args, causing an ABI
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.

Requested and reviewed by: bde
MFC after: 2 weeks


# 59b3a4eb 17-Sep-2010 David E. O'Brien <obrien@FreeBSD.org>

Correct some non-code typos.


# 3634d5b2 20-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


# 61e1c193 16-Jul-2010 John Baldwin <jhb@FreeBSD.org>

Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with: attilio


# dbfcf8cf 16-Jul-2010 John Baldwin <jhb@FreeBSD.org>

When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO. This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock. If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result
the thread blocked on the vnode lock may never get woken up. Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after: 3 days


# 710b73c8 23-May-2010 Andriy Gapon <avg@FreeBSD.org>

MFC r208293: ffs_mount: accept and drop userland-only options that can
be passed from loader(8)

PR: kern/141050


# 57da3de3 23-May-2010 Andriy Gapon <avg@FreeBSD.org>

MFC r207366: ffs_vfsops: restore alphabetic order of options in ffs_opts


# 0b962648 19-May-2010 Andriy Gapon <avg@FreeBSD.org>

ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options. Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR: kern/141050
Reported by: many
Reviewed by: jh, bde
MFC after: 4 days


# deb3b115 29-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

ffs_vfsops: restore alphabetic order of options in ffs_opts

The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after: 1 week


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 0718d64d 18-Apr-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r200796:

Implement NFSv4 ACL support for UFS.

Reviewed by: rwatson


# 8eb59aff 12-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

MFC r206128: ffs_mount: remove redundant assignment of to devvp.v_bufobj


# ecaf3257 03-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj

The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after: 1 week


# 61996181 10-Feb-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused variable.


# 9340fc72 21-Dec-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Implement NFSv4 ACL support for UFS.

Reviewed by: rwatson


# b67ca899 09-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196920:
insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Approved by: re (kensmith)


# 6cc745d2 07-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by: tegge
Approved by: des (pseudofs part)
MFC after: 3 days


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# dfd233ed 11-May-2009 Attilio Rao <attilio@FreeBSD.org>

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


# c1d8b5e8 16-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks


# e65f5a4e 11-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

The non-modifying EA VOPs are executed with only shared vnode lock taken.
Provide a custom lock around initializing and tearing down EA area,
to prevent both memory leaks and double-free of it. Count the number
of EA area accessors.

Lock protocol requires either holding exclusive vnode lock to modify
i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.

Noted by: YAMAMOTO, Taku <taku tackymt homeip net>
Tested by: pho (previous version)
MFC after: 2 weeks


# a9d95371 11-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not double-free the struct inode when insmntque failed. Default
insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory.

Reviewed by: tegge
MFC after: 3 days


# 33fc3625 11-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month


# 5bd65606 09-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after: 1 month


# 4f560d75 23-Feb-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Right now, when trying to unmount a device that's already gone,
msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO.
However, dounmount() treats ENXIO as a success and proceeds with
unmounting. In effect, the filesystem gets unmounted without closing
GEOM provider etc.

Reviewed by: kib
Approved by: rwatson (mentor)
Tested by: dho
Sponsored by: FreeBSD Foundation


# 3c140b2d 23-Feb-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Refactor, moving error checking outside of the
'if (mp->mnt_flag & MNT_SOFTDEP)' conditional. No functional
changes.

Reviewed by: kib
Approved by: rwatson (mentor)
Tested by: pho
Sponsored by: FreeBSD Foundation


# ee445a69 11-Feb-2009 John Baldwin <jhb@FreeBSD.org>

- If the g_access() call for the initial root mount fails, then fully
cleanup. Before the GEOM consumer would not have been closed.
- Bump the reference on the character device being mounted while the
associated devfs vnode is locked.

Reviewed by: kib


# 49c4791c 29-Jan-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make sure the cdev doesn't go away while the filesystem is still mounted.
Otherwise dev2udev() could return garbage.

Reviewed by: kib
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation


# df86ccf6 07-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

If unmount of the ffs mp failed, reinitialize the extended attributes
for the mp, and restart them if autostart is enabled.

Reported and tested by: pho
Reviewed by: rwatson
MFC after: 3 weeks


# 15bc6b2b 28-Oct-2008 Edward Tomasz Napierala <trasz@FreeBSD.org>

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0d7935fd 10-Oct-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# f8886347 24-Sep-2008 John Baldwin <jhb@FreeBSD.org>

Enable shared lookups on UFS. There are some remaining issues with forced
unmounts, but those are in the VFS lookup code are not UFS specific.

Tested by: pho, kris


# 6fecb4e4 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Suspend the write operations on the UFS filesystem being unmounted or
remounted from rw to ro.

Proposed and reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 2814d5ba 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 52dfc8d7 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month


# 90446e36 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

When downgrading the read-write mount to read-only, do_unmount() sets
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.

Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 7b7ed832 28-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by: pho, kris
Reviewed by: tegge
MFC after: 1 month


# e792b09b 09-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

Revert r181345.
Move the NULL pointer check to the vfs_deleteopt() function.

Discussed with: rodrigc
MFC after: 3 days


# a1a917e0 06-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

User may do "mount -o snapshot ...", that causes new FFS mount to be
performed with snapshot option, while the mp->mnt_opt is NULL.
Protect against NULL pointer dereference.

Noted by: Mateusz Guzik <mjguzik gmail com>
MFC after: 3 days


# a80d8caa 19-Jul-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for
a variable that is used only once.


# fb77e0af 23-May-2008 Craig Rodrigues <rodrigc@FreeBSD.org>

After converting the "snapshot" mount option to the MNT_SNAPSHOT flag,
delete "snapshot" from the persistent mount options list.
This should fix problems with doing a mount -o snapshot of a file system, followed by
an NFS export of the same file system.

PR: 122833
Reported by: Leon Kos <leon.kos lecad fs uni-lj si>,
Jaakko Heinonen <jh saunalahti fi>
MFC after: 1 month


# 02a871f1 23-May-2008 Craig Rodrigues <rodrigc@FreeBSD.org>

For the following mount options, do not perform the string to flag conversions
here, because we already do them further up in vfs_donmount() in vfs_mount.c

async -> MNT_ASYNC
force -> MNT_FORCE
multilabel -> MNT_MULTILABEL
noatime -> MNT_NOATIME
noclusterr -> MNT_NOCLUSTERR
noclusterw -> MNT_NOCLUSTERW

MFC after: 1 month


# d952ba1b 26-Mar-2008 John Baldwin <jhb@FreeBSD.org>

Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo'
(such as 'atime' vs 'noatime'). The filesystems will always see either
'nofoo' or 'nonofoo', never plain 'foo'. As such, their list of valid
mount options should include 'nofoo' instead of 'foo'. With this fix,
you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as
noatime without getting an error. You can also update a noatime FFS
filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime'
using nmount(2) (e.g. 7.x /sbin/mount binary).

MFC after: 1 week
Reviewed by: crodig


# 698b1a66 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# e7fd8877 05-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs.
It is normally initialized by ffs_statfs() after ffs_mount finished.

The extattr autostart code calls the ufs_lookup(), that uses value above
to iterate over the directory blocks, see bmask initialization in the
ufs_lookup() and ufsdirhash. Having the filesystem with root directory
spanning more then one block would result in reading a random kernel
memory.

PR: kern/120781
Test case provided by: rwatson
MFC after: 1 week


# 6cf7bc60 03-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

Move setting of MNTK_MPSAFE flag before UFS1 extended attribute
auto-start so that the flag is set before we start performing I/O
in the auto-start routine.

MFC after: 2 weeks
Suggested by: kib


# 628f51d2 24-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


# 0e9eb108 23-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


# d638e093 19-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

- Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 77465d93 16-Oct-2007 Alfred Perlstein <alfred@FreeBSD.org>

Get rid of qaddr_t.

Requested by: bde


# 04533fc6 04-Apr-2007 Xin LI <delphij@FreeBSD.org>

Use *_EMPTY macros when appropriate.


# 36d46679 20-Mar-2007 Konstantin Belousov <kib@FreeBSD.org>

Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more
neccessary Giant acquisions in softdepend processing code.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


# 088ffd20 14-Mar-2007 Konstantin Belousov <kib@FreeBSD.org>

Implement fine-grained locking for UFS quotas.

Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock
with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global
dqhlock mutex.

i_dquot array for inode is protected by lockmgr' vnode lock, corresponding
assert added to the dqget(). Access to struct ufsmount quota-related fields
(um_quotas and um_qflags) is protected by um_lock.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)

This work were not possible without enormous amount of help given by
Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out
numerous errors and provided invaluable suggestions. Peter did tireless
testing of the patch as it was developed.


# 61b9d89f 12-Mar-2007 Tor Egge <tegge@FreeBSD.org>

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


# 10bcafe9 15-Feb-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs


# 2cc7d26f 23-Jan-2007 Konstantin Belousov <kib@FreeBSD.org>

Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 1a60c7fc 31-Oct-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 8d0547c6 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Protect change to bo_flag by holding the bufobj mutex.


# 5da56ddb 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


# 5fe6d2be 09-Jul-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Declare UFS module version.


# 946478fc 09-Jul-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change fs->fs_fsmnt to mp->mnt_stat.f_mntonname in warnings about missing
MAC and ACLs support in the kernel. If it is a first mount, fs->fs_fsmnt
is empty.

MFC after: 1 week


# 71ac2d7c 03-Jun-2006 Craig Rodrigues <rodrigc@FreeBSD.org>

Check the sectorsize of the underlying disk before trying to
bread() the UFS superblock. Should eliminate crashes when trying
to do: mount -t ufs on an audio CD.

PR: kern/85893
Reported by: Russell Francis <rfrancis at ev dot net>
MFC after: 1 week


# ee98eb82 25-May-2006 Craig Rodrigues <rodrigc@FreeBSD.org>

Remove "update" from ffs_opts. It has been moved to global_opts
in vfs_mount.c.


# 5eb304a9 25-May-2006 Craig Rodrigues <rodrigc@FreeBSD.org>

Remove calls to vfs_export() for exporting a filesystem for NFS mounting
from individual filesystems. Call it instead in vfs_mount.c,
after we call VFS_MOUNT() for a specific filesystem.


# 4ba8c2a5 23-May-2006 Craig Rodrigues <rodrigc@FreeBSD.org>

Take errmsg out of ffs_opts. It is already part of global_opts
in vfs_mount.c.


# 868bb88f 02-May-2006 Tor Egge <tegge@FreeBSD.org>

Temporarily undo clusters contribution to global runningbufspace while
handling copy on write for the buffers taking part in the cluster.


# cbd6fedb 27-Apr-2006 Scott Long <scottl@FreeBSD.org>

Fix a typo.


# 6ca9fcc5 27-Apr-2006 Jeff Roberson <jeff@FreeBSD.org>

- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.


# 7b3f1bbd 21-Apr-2006 Tom Rhodes <trhodes@FreeBSD.org>

Revert previous to this file before an actual request is made.


# 8fc22c9d 21-Apr-2006 Tom Rhodes <trhodes@FreeBSD.org>

Remove what I believe are two useless ifdefs. If a user or administrator
enables multilabel, or any option for that matter, most likely they have
a reason. This will allow users to see that mulilabel is enabled via an
issued "mount" command and remove an annoying warning - printed only when
a MAC kernel is not installed - on boot up.

Discussed with: green, brueffer, Samy Al Bahra.
Probably ran past: csjp (though I can't remember).


# 3bbd6d8a 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs().

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.


# 700118c7 19-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Allow compilation when not using softupdates.


# 7de3839d 19-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Let snapshots make a copy of old contents for all buffers taking part in a
cluster instead of just the first buffer.

Delay buf_start() calls until snapshots have a copy of old content.

PR: kern/93942


# 95e7a3c3 19-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Reduce probability of unmount failing after having unmounted snapshots.


# ca2fa807 10-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Block secondary writes while expunging active unlinked files.

Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.

Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).

Perform deferred inactive handling after the file system is resumed.


# 1e70cd7f 09-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Remove unneeded (and broken) usage of MNT_REF()/MNT_REL().


# 791dd2fa 08-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# 82be0a5a 09-Jan-2006 Tor Egge <tegge@FreeBSD.org>

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


# b6bd025c 24-Nov-2005 Craig Rodrigues <rodrigc@FreeBSD.org>

Fix parsing of atime, clusterr, clusterw, exec, suid, symfollow
mount options.

Noticed by: Amir Shalem < amir at boom dot org dot il>


# cea90362 20-Nov-2005 Craig Rodrigues <rodrigc@FreeBSD.org>

If export mount flag is not passed in, set default parameters
for export structure and pass that to vfs_export().
Currently in userland mount(8), an export structure is unconditionally
passed in, only for UFS. This is an attempt to move that UFS-specific
behavior out of mount(8) and into the UFS filesystem code.


# 359d4388 19-Nov-2005 Craig Rodrigues <rodrigc@FreeBSD.org>

Add more options to ffs_opts, so that vfs_filteropts() will not
complain when we pass these options to a UFS filesystem as strings
via nmount(): noexec, nosuid, nosymfollow, sync, suiddir


# 26f59b64 17-Nov-2005 Craig Rodrigues <rodrigc@FreeBSD.org>

- Add parsing for the following existing UFS/FFS mount options in the nmount()
callpath via vfs_getopt(), and set the appropriate MNT_* flag:
-> acls, async, force, multilabel, noasync, noatime,
-> noclusterr, noclusterw, snapshot, update

- Allow errmsg as a valid mount option via vfs_getopt(),
so we can later add a hook to propagate mount errors back
to userspace via vfs_mount_error().


# 8680d698 20-Oct-2005 Nate Lawson <njl@FreeBSD.org>

Adjust maxfilesize for UFS1 and old 4.4 FFS. For UFS1, increase the limit
to (max block - 1) * bsize. For DEV_BSIZE, this doubles the limit from
0.5 TB to 1 TB. For the old 4.4 FFS case, decrease the limit from 0.5 TB
to 2 GB - 1. Older systems had a 32 bit off_t so they couldn't access the
larger files anyway.

Collaboration with: bde


# 9248a827 09-Oct-2005 Tor Egge <tegge@FreeBSD.org>

Don't pretend that a failed sync write was succesful.


# fdedad76 02-Sep-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

ffs_mountfs() needs devvp to be locked, so lock it.

Glanced at by: phk
Tested by: pjd
MFC after: 3 days


# 93373c42 21-Aug-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Set the mountpoint path in the superblock (fs_fsmnt) at mount-time
so that it appears in the various messages (not cleanly unmounted,
filesystem full, etc). This has been broken since rev 1.261.


# ec9c9e73 20-Jul-2005 Alan Cox <alc@FreeBSD.org>

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


# 204ec66d 30-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't set our bio op to be a READ when we've just completed a write. There
are subtle differences in the read and write completion path. Instead,
grab an extra write ref so the write path can drop it when we recursively
call bufdone(). I believe this may be the source of the wrong bufobj
panics.

Reported by: pho, kkenn


# 41d4783d 03-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- In ffs_sync we need to pass LK_SLEEPFAIL in when we lock the vnode
because it may change identities while we're sleeping on the lock.
Otherwise we may bail out of ffs_sync() early due to an error from
deadfs.
- Collapse a VOP_UNLOCK, vrele into a single vput().


# 153910e0 03-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix
two bugs.
- ffs_disk_prewrite was pulling the vp from the buf and checking for
COPYONWRITE, when really it wanted the vp from the bufobj that we're
writing to, which is the devvp. This lead to us skipping the copy on
write to all file data, which significantly broke snapshots for the
last few months.
- When the SOFTUPDATES option was not included in the kernel config we
would also skip the copy on write check, which would effectively disable
snapshots.
- Remove an invalid mp_fixme().

Debugging tips from: mckusick
Reported by: iedowse, others
Discussed with: phk


# aa7ba427 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- FFS supports shared locks, clear LK_NOSHARE from our vnode locks.

Sponsored by: Isilon Systems, Inc.


# d6919865 29-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Upgrade a shared lock request to exclusive in ffs_vget() if we have
to create the vnode.

Sponsored by: Isilon Systems, Inc.


# 51f5ce0c 16-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add two arguments to the vfs_hash() KPI so that filesystems which do
not have unique hashes (NFS) can also use it.


# de68347b 15-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't hold a reference on the disk vnode for each inode.


# 45c26fa2 15-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Improve the vfs_hash() API: vput() the unneeded vnode centrally to
avoid replicating the vput in all the filesystems.


# e82ef95c 15-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify the vfs_hash calling convention.


# 14bc0685 14-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Use vfs_hash instead of home-rolled.


# fe68abe2 12-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.
- Shorten ffs_reload by one step. The old check for an inactive vnode
was slightly racey, and the code which deals with still active vnodes
is not much more expensive.

Sponsored by: Isilon Systems, Inc.


# dfd4be14 19-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


# adf41577 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make a some SYSCTL_NODEs and some of FFS's VFS_ methods static.


# 02f2c6a9 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Split the vop_vector for ffs1 and ffs2, this is mostly for the different
EXTATTR support.


# dd19a799 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Background writes are entirely an FFS/Softupdates thing.

Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.

Remove all the background write related stuff from the normal bufwrite.

This drags the softdep_move_dependencies() back into FFS.

Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.


# efd6d980 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


# 40854ff5 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 84a69752 25-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce and use g_vfs_close().


# ce12d37e 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't create vnode_pager objects for the disk device.
geom_vfs will do that.


# 3ba649d7 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Initialize and destroy the per-filesystem ufs lock where appropriate.
- Use the buffer lock on the superblock buf to serialize calls to
sbupdate.
- Set the MNTK_MPSAFE flag when QUOTA is not defined in the kernel.

Sponsored By: Isilon Systems, Inc.


# 39cfb239 15-Jan-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix ACLs handling for the root file system.
Without this fix, when ACLs are set via tunefs(8) on the root file system,
they are removed on boot when 'mount -a' is called, because mount(8)
called for the root file system always add MNT_UPDATE flag and MNT_UPDATE
flag isn't perfect.
Now, one cannot remove ACLs stored in superblock (configured with tunefs(8))
via 'mount -a' nor 'mount -u -o noacls <file system>', but it is still
possible to mount file system which doesn't have ACLs in superblock via
'mount -o acls <file system>' or /etc/fstab's 'acls' option.

Reported by: Lech Lorens/pl.comp.os.bsd
Discussed with: phk, rwatson
Reviewed by: rwatson
MFC after: 2 weeks


# 7c0745ee 14-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# e39db32a 12-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# 6ef8480a 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 4a18054d 12-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

With the introduction of UFS2 we started looking for superblocks in
four different locations on a prospective filesystem.

If we found none, we forgot to invalidate the four buffers, thus the
following sequence would fails:

(md0 = blank disk)
mount /dev/md0 /mnt
(fails, no superblocks)
newfs /dev/md0
(writes using physio which does not go through buffercache).
mount /dev/md0 /mnt
(still fails, the four cached buffers still contain no superblocks)

Found by: ru


# f21cc2ca 07-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Fix nfs exports (for now). The real fix is to teach mountd about
nmount.


# 20a92a18 07-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


# 74331236 05-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

VFS_STATFS(mp, ...) is mostly called with &mp->mnt_stat, but a few cases
doesn't. Most of the implementations have grown weeds for this so they
copy some fields from mnt_stat if the passed argument isn't that.

Fix this the cleaner way: Always call the implementation on mnt_stat
and copy that in toto to the VFS_STATFS argument if different.


# 93e0b506 03-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

typo in comment.


# aec0fb7b 01-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


# 964ebefd 25-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Use system wide no-op vfs_start function.


# 51ac12ab 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Be prepared to accept NULL mountargs as part of root-mounting.


# cf5e4149 12-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put back the vfs_object_create() calls, they do make a difference when
my test-setup does what I want it to instead of what I ask it to.

Pointed out by: tegge


# 40ce27cb 09-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

fix some comments


# 2e664919 09-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Use mount flags instead of NULL path to detect root filesystem mount.


# 5e2ccaff 09-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Stop pretending to have a vm_object backing the underlying disk vnode:
it isn't used for anything anywhere and the vnode_pager would explode
if we attempted to.


# 40c340aa 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Don't grab the exclusive bit on a root filesystem until we are willing
to mount it. Doing so prevented fsck to be run after a refused mount.


# 43920011 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move UFS from DEVFS backing to GEOM backing.

This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.

Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1

Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.

Details:

Add a geom consumer and a bufobj pointer to ufsmount.

Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.

In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.

When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().

If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.

Set bo_private to the GEOM consumer on the device bufobj.

Change the ffs_ops->strategy function to call g_vfs_strategy()

In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.

In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.

Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.


# 570a7dda 28-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

We only support backing UFS/FFS with disks.


# 8dd56505 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

White space changes. Add missing static.


# 6e77a041 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

The island council met and voted buf_prewrite() home.

Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.

Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.

Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.


# 5d9d81e7 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# 156cb265 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# ee1d0eb3 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_bsize. This was a dead-end.


# 8d02a378 05-Oct-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Back out changes which were introduced to delay mounting root file system.
Those changes were made on gmirror needs, but now gmirror handles this
by itself.


# 4f116178 28-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove support for accessing device nodes in UFS/FFS.

Device nodes can still be created and exported with NFS.


# 5a19f8b0 23-Sep-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Introduce new /boot/loader.conf variable: root_mount_delay.
It can be used to delay mounting root partition to give a chance to GEOM
providers to show up.
Now, when there is no needed provider, vfs_rootmount() function will look
for it every second and if it can't be find in defined time, it'll ask
for root device name (before this change it was done immediately).

This will allow to boot from gmirror device in degraded mode.


# 5e8c582a 30-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put a version element in the VFS filesystem configuration structure
and refuse initializing filesystems with a wrong version. This will
aid maintenance activites on the 5-stable branch.

s/vfs_mount/vfs_omount/

s/vfs_nmount/vfs_mount/

Name our filesystems mount function consistently.

Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space. A few places abused
it to get hold of some credentials to pass around. Effectively
it is unused.

Reorganize the root filesystem selection code.


# d634f693 28-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove global variable rootdevs and rootvp, they are unused as such.

Add local rootvp variables as needed.

Remove checks for miniroot's in the swappartition. We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.


# b403319b 28-Jul-2004 Alexander Kabaev <kan@FreeBSD.org>

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# d8d3d415 14-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Make sure to update the mnt_stats before UFS1 extattr tried to
do I/O on the device. Otherwise the blocksize is undefined in the
buffer cache.


# f257b7a5 12-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


# c94cd5fc 07-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Explicity initialize vp->v_bsize.


# e3c5a7a4 04-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 451079d4 29-Apr-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Revert previous change to this file because it breaks some
things which compare /etc/fstab entries to results from
getfsstat(). The real way to fix this is to make 'ufs2'
a recognized filesystem (for real, no beating around the
bush).

This should fix things like 'umount -a -t ufs' now.
Appologies for the previous breakage.


# 2aebb586 26-Apr-2004 Bosko Milekic <bmilekic@FreeBSD.org>

The previous change to mount(8) to report ufs or ufs2 used
libufs, which only works for Charlie root.

This change reverts the introduction of libufs and moves the
check into the kernel. Since the f_fstypename is the same
for both ufs and ufs2, we check fs_magic for presence of
ufs2 and copy "ufs2" explicitly instead.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>


# 012d4134 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.

Approved by: core, rwatson


# e9827c6d 13-Feb-2004 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs:
- don't unlock the vnode after vinvalbuf() only to have to relock it
almost immediately.
- don't refer to devices classified by vn_isdisk() as block devices.


# 0efb1394 12-Feb-2004 Bruce Evans <bde@FreeBSD.org>

MFextfs: backed out secondary changes in rev.1.40 that had become just
style bugs (a variable that is used only once, and misformattings).


# 8adff5fc 12-Feb-2004 Bruce Evans <bde@FreeBSD.org>

Fixed some minor style bugs (English usage and formatting of binary
operators) in and near revs.1.169-1.170 (open mode bandaid). This
(or better a proper fix) should have been done before cloning the
bandaid to many other file systems.


# 31c81e4b 06-Dec-2003 Don Lewis <truckman@FreeBSD.org>

Set fs_ronly to the correct value in ffs_reload() when reloading the file
system super block after fsck has repaired the file system. The value of
fs_ronly was getting overwritten, which caused ffs_update() to attempt to
update inode timestamps even though the file system was still mounted
read-only.

This fixes the "giving up on N buffers" error that is triggered by running
fsck on the root file system and then rebooting without mounting the file
system read-write.


# fde81c7d 12-Nov-2003 Kirk McKusick <mckusick@FreeBSD.org>

Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.


# ca430f2e 04-Nov-2003 Alexander Kabaev <kan@FreeBSD.org>

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


# 45d45c6c 02-Nov-2003 Alexander Kabaev <kan@FreeBSD.org>

Use VOP_UNLOCK/vrele instead of vput. td was erecived as a parameter
and one cannot be sure it is equal to curthread.


# cb9ddc80 01-Nov-2003 Alexander Kabaev <kan@FreeBSD.org>

Take care not to call vput if thread used in corresponding vget
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.

The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.

This fixes the panic on shutdown people have reported.


# 492c1e68 31-Oct-2003 Alexander Kabaev <kan@FreeBSD.org>

Temporarily undo parts of the stuct mount locking commit by jeff.
It is unsafe to hold a mutex across vput/vrele calls.

This will be redone when a better locking strategy is agreed upon.

Discussed with: jeff


# 69b609a8 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- The VCHR case in ffs_sync() is an unneccsary optimization especially
considering how infrequently we access devices via ffs now that we have
devfs. Collapse this case with the other case.

Obtained from: bde


# ab1f917b 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Further simplify ffs_sync(). The vnode lock is required for UFS_UPDATE()
so make the code slightly more uniform. The vnode lock is acquired in
all cases and now the only difference between VCHR and other is we
call UFS_UPDATE instead of VOP_FSYNC().


# 2f05568a 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Check the XLOCK before inspecting v_data.
- Slightly rewrite the fsync loop to be more lock friendly. We must
acquire the vnode interlock before dropping the mnt lock. We must
also check XLOCK to prevent vclean() races.
- Use LK_INTERLOCK in the vget() in ffs_sync to further prevent vclean()
races.
- Use a local variable to store the results of the nvp == TAILQ_NEXT
test so that we do not access the vp after we've vrele()d it.
- Add an XXX comment about UFS_UPDATE() not being protected by any lock
here. I suspect that it should need the VOP lock.


# 04a17687 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Increase the scope of the interlock in ffs_reload(). Acquire it before
we release the mntvnode_mtx.
- Call vgonel() directly instead of going through vrecycle() since we own
the interlock now.
- Remove a few cases where we locked the interlock just so that we could
call VOP_UNLOCK with interlock held.


# 5c24d6ee 15-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the i_devvp field from the incore UFS inodes, we can
get the same value from ip->i_ump->um_devvp.

This saves a pointer in the memory copies of inodes, which can
easily run into several hundred kilobytes.

The extra indirection is unmeasurable in benchmarks.

Approved by: mckusick


# a8d43c90 26-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a "int fd" argument to VOP_OPEN() which in the future will
contain the filedescriptor number on opens from userland.

The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*

For now pass -1 all over the place.


# 7652131b 12-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize struct vfsops C99-sparsely.

Submitted by: hmp
Reviewed by: phk


# f4636c59 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 6280ed26 31-May-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused local variables.

Found by: FlexeLint


# 36329289 01-May-2003 Tim J. Robbins <tjr@FreeBSD.org>

Do not attempt to free NULL dinodes (i_din1 or i_din2) in ffs_ifree().
These fields can be left as NULL if ffs_vget() allocates an inode but
fails before the dinode memory has been allocated. There are two cases
when this can occur: when we lose a race and another process has added
the inode to the hash, and when reading the inode off disk fails.

The bug was observed by Kris on one of the package-building machines.
See http://marc.theaimsgroup.com/?l=freebsd-current&m=105172731013411&w=2
In Kris's case, it was the bread() that failed because of a disk error.

The alternative to this patch is to ensure that ffs_vget() does not call
vput() when the inode that hasn't been properly initialised.


# 8d721e87 01-May-2003 Tim J. Robbins <tjr@FreeBSD.org>

Free i_din2 instead of i_din1 in ffs_ifree() on UFS2 filesystems.
This is purely a cosmetic change because these members are in a
union together.


# 31566c96 20-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Use td->td_ucred instead of td->td_proc->p_ucred.


# b4b138c2 18-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 7261f5f6 03-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# 74f3809a 25-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

Change the field used to test whether the superblock has been updated
from the filesystem size field to the filesystem maximum blocksize
field. The problem is that older versions of growfs updated only the
new size field and not the old size field. This resulted in the old
(smaller) size field being copied up to the new size field which
caused the filesystem to appear to fsck to be badly trashed.

This also adds a sanity check to ensure that the superblock is not
being updated when the filesystem is mounted read-only. Obviously
such an update should never happen.

Reported by: Nate Lawson <nate@root.org>
Sponsored by: DARPA & NAI Labs.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# aca3e497 14-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

Replace use of random() with arc4random() to provide less guessable
values for the initial inode generation numbers in newfs and for
newly allocated inode generation numbers in the kernel.

Submitted by: Theo de Raadt <deraadt@cvs.openbsd.org>
Sponsored by: DARPA & NAI Labs.


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# aa4d7a8a 27-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Use three UMA zones for FFS/UFS inodes instead of malloc space.
Since inodes are currently 144 bytes, this will save 112 bytes per
inode. This can amount to up to 10MByte on large systems.


# de6ba7c0 27-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Move the allocation of the inode contents into ffs_vfsops.c rather than
passing malloc types around.


# 975512a9 27-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make ffs_mountfs() static.

Remove the malloctype from the ufs mount structure, instead add a callback
to the storage method for freeing inodes: UFS_IFREE().

Add vfs_ifree() method function which frees an inode.

Unvariablelize the malloc type used for allocating inodes.


# 31574422 30-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add a check to disable the previous patch so that future filesystems
that choose to place their superblocks in non-standard locations will
not get them smashed.

Sponsored by: DARPA & NAI Labs.


# fa5d33e2 29-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Check to make sure that the fs_sblockloc field was properly updated
before using it to write the superblock. This is to guard against
accidentally trashing the disklabel if the superblock format missed
being upgraded by the new kernel.

Reported by: Sam Leffler <sam@errno.com>
Sponsored by: DARPA & NAI Labs.
Approved by: Murray Stokely <murray@FreeBSD.org>


# ada981b2 26-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Create a new 32-bit fs_flags word in the superblock. Add code to move
the old 8-bit fs_old_flags to the new location the first time that the
filesystem is mounted by a new kernel. One of the unused flags in
fs_old_flags is used to indicate that the flags have been moved.
Leave the fs_old_flags word intact so that it will work properly if
used on an old kernel.

Change the fs_sblockloc superblock location field to be in units
of bytes instead of in units of filesystem fragments. The old units
did not work properly when the fragment size exceeeded the superblock
size (8192). Update old fs_sblockloc values at the same time that
the flags are moved.

Suggested by: BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk>
Sponsored by: DARPA & NAI Labs.


# 763bbd2f 26-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 9ab73fd1 24-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# 80830407 15-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

If the FS_MULTILABEL flag is set in a UFS or UFS2 superblock,
automatically set MNT_MULTILABEL in the mount flags.

If FS_ACLS is set in a UFS or UFS2 superblock, automatically
set MNT_ACLS in the mount flags.

If either of these flags is set, but the appropriate kernel option
to support the features associated with the flag isn't available,
then print a warning at mount-time.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# a5b65058 13-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by: DARPA & NAI Labs.


# b6cef564 08-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

The appropriate units for disk block addresses are always DEV_BSIZE,
even when the underlying device has a larger sector size. Therefore,
the filesystem code should not (and with this patch does not) try to
use the underlying sector size when doing disk block address calculations.

This patch fixes problems in -current when using the swap-based
memory-disk device (mdconfig -a -t swap ...). This bugfix is not
relevant to -stable as -stable does not have the memory-disk device.

Sponsored by: DARPA & NAI Labs.


# 2ee5711e 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Convert locks to use standard macros.
- Lock access to the buflists.
- Document broken locking.
- Use vrefcnt().


# 06be2aaa 14-Sep-2002 Nate Lawson <njl@FreeBSD.org>

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# d6fe88e4 13-Aug-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an
UFS only thing, and FFS should in principle not know if it is enabled
or not.

This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c

Sponsored by: DARPA and NAI Labs.


# 9bf1a756 13-Aug-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce typedefs for the member functions of struct vfsops and employ
these in the main filesystems. This does not change the resulting code
but makes the source a little bit more grepable.

Sponsored by: DARPA and NAI Labs.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 5346934f 01-Jul-2002 Ian Dowse <iedowse@FreeBSD.org>

Add the ffs bits necessary to support unloading of the ufs kernel
module. This adds an ffs_uninit() function that calls ufs_uninit()
and also calls a new softdep_uninitialize() function. Add a stub
for softdep_uninitialize() to cover the non-SOFTUPDATES case.

Reviewed by: mckusick


# 8f42fb8f 26-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Remove the kernel file-size limit for UFS2, so that only the limit
imposed by the filesystem structure itself remains. With 16k blocks,
the maximum file size is now just over 128TB.

For now, the UFS1 file size limit is left unchanged so as to remain
consistent with RELENG_4, but it too could be removed in the future.

Reviewed by: mckusick


# cfbf0a46 23-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by: mckusick


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 13866b3f 06-Jun-2002 Semen Ustimenko <semenu@FreeBSD.org>

Fix a typo in my recently added comment: s/beleived/believed/

Submitted by: keramida


# f576a00d 30-May-2002 Semen Ustimenko <semenu@FreeBSD.org>

Remove lock from ffs_vget introduced by v1.24. Instead of locking the
vnode creation globaly, we allow processes to create vnodes concurently.
In case of concurent creation of vnode for the one ino, we allow processes
to race and then check who wins.

Assuming that concurent creation of vnode for same ino is really rare case,
this is belived to be an improvement, as it just allows concurent creation
of vnodes.

Idea by: bp
Reviewed by: dillon
MFC after: 1 month


# 00b162d0 18-May-2002 Ian Dowse <iedowse@FreeBSD.org>

Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u
unions, since these were only necessary when ext2fs used ufs code.

Reviewed by: mckusick


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# 05f4ff5d 13-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove register keyword.

Sponsored by: DARPA & NAI Labs.
Submitted by: mckusick


# 2dd527b3 08-Apr-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>.

Sponsored by: DARPA & NAI Labs


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 46a67eac 02-Apr-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl.


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 0508986c 30-Mar-2002 Bruce Evans <bde@FreeBSD.org>

In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally
provided the latter is nonzero. At this point, the former is a fairly
arbitrary default value (DFTPHYS), so changing it to any reasonable
value specified by the device driver is safe. Using the maximum of
these limits broke ffs clustered i/o for devices whose si_iosize_max
is < DFLTPHYS. Using the minimum would break device drivers' ability
to increase the active limit from DFTLPHYS up to MAXPHYS.

Copied the code for this and the associated (unnecessary?) fixup of
mp_iosize_max to all other filesystems that use clustering (ext2fs and
msdosfs). It was completely missing.

PR: 36309
MFC-after: 1 week


# 6f1e8551 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a0595d02 16-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.


# 063f7763 11-Mar-2002 Poul-Henning Kamp <phk@FreeBSD.org>

I missed one VOP_CLOSE in the previous commit.

Pointed out by: bde


# 3dbceccb 11-Mar-2002 Poul-Henning Kamp <phk@FreeBSD.org>

As a XXX bandaid open the mounted device READ/WRITE even if we only mount
read-only.

The trouble here is that we don't reopen the device in read/write mode
when we remount in read/write mode resulting in a filesystem sending
write requests to a device which was only opened read/only.

I'm not quite sure how such a reopen would best be done and defer
the problem to more agile hackers.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# cd600596 15-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by: "Sam Leffler" <sam@errno.com>


# 23b59018 20-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


# 143a5346 16-Dec-2001 Ian Dowse <iedowse@FreeBSD.org>

Make sure we ignore the value of `fs_active' when reloading the
superblock, and move the initialisation of it to beside where other
pointer fields are initialised.


# cc5a9233 13-Dec-2001 Kirk McKusick <mckusick@FreeBSD.org>

Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.


# 245df27c 25-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# c72ccd01 22-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


# b73d2870 02-Oct-2001 Robert Watson <rwatson@FreeBSD.org>

o Replace two direct uid!=0 comparisons with suser_td() calls.

Obtained from: TrustedBSD Project


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 4691e9ea 09-Sep-2001 Ian Dowse <iedowse@FreeBSD.org>

The "dirpref" directory layout preference improvements make use of
an array "fs_contigdirs[]" to avoid too many directories getting
created in each cylinder group. The memory required for this and
two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with
a single malloc() call, and divided up afterwards. However, the
'space' pointer is not advanced correctly, so fs_contigdirs and
fs_maxcluster end up pointing to the same address.

Add the missing code to advance the 'space' pointer, and remove
an unnecessary update of the pointer that follows.

This is likely to fix the "ffs_clusteralloc: map mismatch" panics
that have been reported recently.

Submitted by: Luke Mewburn <lukem@wasabisystems.com>


# 7df97b61 01-Sep-2001 Robert Watson <rwatson@FreeBSD.org>

o At some point, unmounting a non-EA file system with EA's compiled
in got a bit broken, when ufs_extattr_stop() was called and failed,
ufs_extattr_destroy() would panic. This makes the call to destroy()
conditional on the success of stop().

Submitted by: Christian Carstensen <cc@devcon.net>
Obtained from: TrustedBSD Project


# ed87274d 28-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Fix more mntvnode and vnode interlock order reversals.


# 49d2d9f4 27-Jun-2001 John Baldwin <jhb@FreeBSD.org>

- Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.


# c7a3e237 29-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Remove last vestiges of MFS.


# 0864ef1e 16-May-2001 Ian Dowse <iedowse@FreeBSD.org>

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


# 9ccb939e 08-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.


# 855aa097 28-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 112f7372 25-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.


# a13234bb 25-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>


# 5d69bac4 23-Apr-2001 Ian Dowse <iedowse@FreeBSD.org>

Pre-dirpref versions of fsck may zero out the new superblock fields
fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause
panics if these fields were zeroed while a filesystem was mounted
read-only, and then remounted read-write.

Add code to ffs_reload() which copies the fs_contigdirs pointer
from the previous superblock, and reinitialises fs_avgf* if necessary.

Reviewed by: mckusick


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 5819ab3f 16-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

ffs_clusteralloc: map mismatch
ffs_nodealloccg: map corrupted
ffs_nodealloccg: block not in map
ffs_alloccg: map corrupted
ffs_alloccg: block not in map
ffs_alloccgblk: cyl groups corrupted
ffs_alloccgblk: can't find blk in cyl
ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

ffs_valloc: dup alloc
ffs_blkfree: freeing free block
ffs_blkfree: freeing free frag
ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.


# 1a6a6610 13-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
* Filesystem flags.
*
* Note that the FS_NEEDSFSCK flag is set and cleared only by the
* fsck utility. It is set when background fsck finds an unexpected
* inconsistency which requires a traditional foreground fsck to be
* run. Such inconsistencies should only be found after an uncorrectable
* disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
* it has successfully cleaned up the filesystem. The kernel uses this
* flag to enforce that inconsistent filesystems be mounted read-only.
*/
#define FS_UNCLEAN 0x01 /* filesystem not clean at mount */
#define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */
#define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */


# a61ab64a 10-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.

------

One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.

First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:

1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
test is at wd1. Size of test file system is 8 Gb, number of cg=991,
size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
from Dec 2000 with BUFCACHEPERCENT=35

2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50

You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html

Test Results

tar -xzf ports.tar.gz rm -rf ports
mode old dirpref new dirpref speedup old dirprefnew dirpref speedup
First system
normal 667 472 1.41 477 331 1.44
async 285 144 1.98 130 14 9.29
sync 768 616 1.25 477 334 1.43
softdep 413 252 1.64 241 38 6.34
Second system
normal 329 81 4.06 263.5 93.5 2.81
async 302 25.7 11.75 112 2.26 49.56
sync 281 57.0 4.93 263 90.5 2.9
softdep 341 40.6 8.4 284 4.76 59.66

"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.

------

Algorithm description

The old dirpref algorithm is described in comments:

/*
* Find a cylinder to place a directory.
*
* The policy implemented by this algorithm is to select from
* among those cylinder groups with above the average number of
* free inodes, the one with the smallest number of directories.
*/

A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.

What I mean by a big file system ?

1. A big filesystem is a filesystem which occupy 20-30 or more percent
of total drive space, i.e. first and last cylinder are physically
located relatively far from each other.
2. It has a relatively large number of cylinder groups, for example
more cylinder groups than 50% of the buffers in the buffer cache.

The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.

My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
* Find a cylinder group to place a directory.
*
* The policy implemented by this algorithm is to allocate a
* directory inode in the same cylinder group as its parent
* directory, but also to reserve space for its files inodes
* and data. Restrict the number of directories which may be
* allocated one after another in the same cylinder group
* without intervening allocation of files.
*
* If we allocate a first level directory then force allocation
* in another cylinder group.
*/

My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.

My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.

The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:

int32_t fs_avgfilesize; /* expected average file size */
int32_t fs_avgfpdir; /* expected # of files per directory */

These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.

I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.

Obtained from: Grigoriy Orlov <gluk@ptci.ru>


# 812b1d41 20-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Add kernel support for running fsck on active filesystems.


# 516081f2 18-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change
reflects the fact that our EA support is implemented entirely at the
UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
events). This also better reflects the fact that [shortly] MFS will also
support EAs, as well as possibly IFS.

o Consumers of the EA support in FFS are reminded that as a result, they
must change kernel config files to reflect the new option names.

Obtained from: TrustedBSD Project


# f5161237 13-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Implement "options FFS_EXTATTR_AUTOSTART", which depends on
"options FFS_EXTATTR". When extended attribute auto-starting
is enabled, FFS will scan the .attribute directory off of the
root of each file system, as it is mounted. If .attribute
exists, EA support will be started for the file system. If
there are files in the directory, FFS will attempt to start
them as attribute backing files for attributes baring the same
name. All attributes are started before access to the file
system is permitted, so this permits race-free enabling of
attributes. For attributes backing support for security
features, such as ACLs, MAC, Capabilities, this is vital, as
it prevents the file system attributes from getting out of
sync as a result of file system operations between mount-time
and the enabling of the extended attribute. The userland
extattrctl tool will still function exactly as previously.
Files must be placed directly in .attribute, which must be
directly off of the file system root: symbolic links are
not permitted. FFS_EXTATTR will continue to be able
to function without FFS_EXTATTR_AUTOSTART for sites that do not
want/require auto-starting. If you're using the UFS_ACL code
available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART
is recommended.

o This support is implemented by adding an invocation of
ufs_extattr_autostart() to ffs_mountfs(). In addition,
several new supporting calls are introduced in
ufs_extattr.c:

ufs_extattr_autostart(): start EAs on the specified mount
ufs_extattr_lookup(): given a directory and filename,
return the vnode for the file.
ufs_extattr_enable_with_open(): invoke ufs_extattr_enable()
after doing the equililent of vn_open()
on the passed file.
ufs_extattr_iterate_directory(): iterate over a directory,
invoking ufs_extattr_lookup() and
ufs_extattr_enable_with_open() on each
entry.

o This feature is not widely tested, and therefore may contain
bugs, caution is advised. Several changes are in the pipeline
for this feature, including breaking out of EA namespaces into
subdirectories of .attribute (this is waiting on the updated
EA API), as well as a per-filesystem flag indicating whether
or not EAs should be auto-started. This is required because
administrators may not want .attribute auto-started on all
file systems, especially if non-administrators have write access
to the root of a file system.

Obtained from: TrustedBSD Project


# 589c7af9 07-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).


# f3a90da9 01-Mar-2001 Adrian Chadd <adrian@FreeBSD.org>

Reviewed by: jlemon

An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
kernel variables in vfs_mount(). `data' remains a pointer into
userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
aren't too long.

* rework mount() and linux_mount() to take the userland parameters
(besides data, as mentioned) and pass kernel variables to vfs_mount().
(linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
filesystem. This variable is generally initialised with `path', and
each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 1b367556 23-Jan-2001 Jason Evans <jasone@FreeBSD.org>

Convert all simplelocks to mutexes and remove the simplelock implementations.


# f55ff3f3 15-Jan-2001 Ian Dowse <iedowse@FreeBSD.org>

The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by: mckusick


# 937c4dfa 13-Dec-2000 Seigo Tanimura <tanimura@FreeBSD.org>

Do not race for the lock of an inode hash.

Reviewed by: jhb


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 0b0c10b4 13-Oct-2000 Adrian Chadd <adrian@FreeBSD.org>

Initial commit of IFS - a inode-namespaced FFS. Here is a short
description:

How it works:
--

Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.

File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:

fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);

printf("new file is %d\n", (int)st.st_ino);

Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:

fd = open("5", O_RDWR);

The creation permissions depend entirely if you have write access to the
root directory of the filesystem.

To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.

--

What this entails:

* patching conf/files and conf/options to include IFS as a new compile
option (and since ifs depends upon FFS, include the FFS routines)

* An entry in i386/conf/NOTES indicating IFS exists and where to go for
an explanation

* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
routines require (ffs_mount() and ffs_reload())

* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
routines. IFS replaces some of the vfsops, and a handful of vnops -
most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
Any other directory operation is marked as invalid.

What this results in:

* an IFS partition's create permissions are controlled by the perm/ownership of
the root mount point, just like a normal directory

* Each inode has perm and ownership too

* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
completely seperate filesystem here

* Softupdates doesn't work with IFS, and really I don't think it needs it.
Besides, fsck's are FAST. (Try it :-)

* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
this particular inode, and unravelling THAT code isn't trivial. Therefore,
useful inodes start at 3.

Enjoy, and feedback is definitely appreciated!


# 7eb9fca5 09-Oct-2000 Eivind Eklund <eivind@FreeBSD.org>

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


# ff435dcb 06-Oct-2000 Robert Watson <rwatson@FreeBSD.org>

o Move initialization of ump from mp to the top of the function so that
it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic
due to a NULL pointer dereference.

Submitted by: Wesley Morgan <morganw@chemicals.tacorp.com>


# 9de54ba5 03-Oct-2000 Robert Watson <rwatson@FreeBSD.org>

o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean
up lock on extattrs.
o Get for free a comment indicating where auto-starting of extended
attributes will eventually occur, as it was in my commit tree also.
No implementation change here, only a comment.


# a18b1f1d 03-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 67e87166 25-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


# 8694d8e9 01-Aug-2000 Ollivier Robert <roberto@FreeBSD.org>

Fix the lockmgr panic everyone is seeing at shutdown time.
vput assumes curproc is the lock holder, but it's not true in this case.

Thanks a lot Luoqi !

Submitted by: luoqi
Tested by: phk


# 9b971133 23-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# 3fbd9742 12-Jul-2000 Boris Popov <bp@FreeBSD.org>

Prevent possible dereference of NULL pointer.

Submitted by: Marius Bendiksen <mbendiks@eunet.no>


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# b2b0497a 03-Jun-2000 Robert Watson <rwatson@FreeBSD.org>

o If FFS_EXTATTR is defined, don't print out an error message on unmount
if an FFS partition returns EOPNOTSUPP, as it just means extended
attributes weren't enabled on that partition. Prevents spurious
warning per-partition at shutdown.


# f3706a03 07-May-2000 Robert Watson <rwatson@FreeBSD.org>

s/ffs_unmonut/ffs_unmount/ in a gratuitous ufs_extattr printf.

Reported by: knu


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 2c9b67a8 30-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


# a64ed089 14-Apr-2000 Robert Watson <rwatson@FreeBSD.org>

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# cf60e8e4 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# 7e58bfac 23-Dec-1999 Bruce Evans <bde@FreeBSD.org>

Update the unclean flag for mount -u. I forgot to handle this case
when I made the absence of the clean flag sticky in rev.1.88. This
was a problem main for "mount /". There is no way to mount "/" for
writing without using mount -u (normally implicitly), so after
"mount -f /" of an unclean filesystem, the absence of the clean flag
was sticky forever.


# 91f37dcb 18-Dec-1999 Robert Watson <rwatson@FreeBSD.org>

Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by: eivind


# 762e6b85 15-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Introduce NDFREE (and remove VOP_ABORTOP)


# 38224dcd 22-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 698f9cf8 09-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Next step in the device cleanup process.

Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.


# 5bd5c8b9 02-Nov-1999 Bruce Evans <bde@FreeBSD.org>

Quick fix for breakage of ext2fs link counts as reported by stat(2) by
the soft updates changes: only report the link count to be i_effnlink
in ufs_getattr() for file systems that maintain i_effnlink.

Tested by: Mike Dracopoulos <mdraco@math.uoa.gr>


# 6d147828 01-Nov-1999 Mike Smith <msmith@FreeBSD.org>

Newline-terminate the complaint message about not being able to find
the root vnode pointer.


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# b89392e7 30-Sep-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the D_NOCLUSTER[RW] options which were added because vn had
problems. Now that Matt has fixed vn, this can go. The vn driver
should have used d_maxio (now si_iosize_max) anyway.


# 1b5464ef 29-Sep-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove v_maxio from struct vnode.

Replace it with mnt_iosize_max in struct mount.

Nits from: bde


# c24fda81 10-Sep-1999 Alfred Perlstein <alfred@FreeBSD.org>

Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from: NetBSD


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 41d2e3e0 24-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.


# 7dc5cd04 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

The bdevsw() and cdevsw() are now identical, so kill the former.


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 68de329e 11-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Use the fsid from the superblock, unless it looks bogus or has already
been taken by some other filesystem.


# 2447bec8 31-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


# 4be2eb8c 08-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


# 46eede00 07-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Continue where Julian left off in July 1998:

Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.

Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)

Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# de5d1ba5 07-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


# fb116777 05-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Remove the 'waslocked' parameter to vfs_object_create().


# a777e820 01-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Remove the last clients of vfs_object_create(..., waslocked=1);
waslocked will go away shortly.

Reviewed by: dg


# 40c8cfe5 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


# b5ee1640 27-Oct-1998 Bruce Evans <bde@FreeBSD.org>

Oops, the redundant tests for major numbers weren't redundant here.
They checked for the magic major number for the "device" behind mfs
mount points. Use a more obvious check for this device.

Debugged by: Andrew Gallatin <gallatin@cs.duke.edu>


# 9c0619da 25-Oct-1998 Bruce Evans <bde@FreeBSD.org>

Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted
when bdevsw[] became sparse. We still depend on magic to avoid having to
check that (v_rdev) device numbers in vnodes are not NODEV.

Removed redundant `major(dev) < nblkdev' tests instead of updating them.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 0922cce6 25-Sep-1998 Bruce Evans <bde@FreeBSD.org>

Fixed clean flag handling:
- don't set the clean flag on unmount of an unclean filesystem that was
(forcibly) mounted rw.
- set the clean flag on rw -> ro update of a mounted initially-clean
filesystem.
- fixed some style bugs (mostly long lines).

This uses the fs_flags field and FS_UNCLEAN state bit which were
introduced in the softdep changes. NetBSD uses extra state bits in
fs_clean.

Reviewed by: luoqui


# d024c955 14-Sep-1998 Søren Schmidt <sos@FreeBSD.org>

Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.


# 8994ca3c 07-Sep-1998 Bruce Evans <bde@FreeBSD.org>

Removed statically configured mount type numbers (MOUNT_*) and all
references to them.

The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.


# 0492d857 17-Aug-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# bcbd6c6f 08-Jul-1998 Julian Elischer <julian@FreeBSD.org>

Don't update superblock if mounted readonly,
also fixes some problems with softupdates on root.
More cleanups are needed here..
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>


# 8435e0ae 04-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

Use size_t instead of u_int for sizes.


# c11d2981 18-May-1998 Julian Elischer <julian@FreeBSD.org>

try stop the user from using mount -u to set the async flag on
a filesystem currently using soft updates.
Also needs a new copy of ffs_softdep.c to complete the fix.


# 79cc756d 05-May-1998 Mike Smith <msmith@FreeBSD.org>

As described by the submitter:

Reverse the VFS_VRELE patch. Reference counting of vnodes does not need
to be done per-fs. I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.

The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.

Submitted by: Michael Hancock <michaelh@cet.co.jp>


# c0bab11d 19-Apr-1998 Julian Elischer <julian@FreeBSD.org>

Make the devfs SLICE option a standard type option.
(hopefully it will go away eventually anyhow)


# 3e425b96 19-Apr-1998 Julian Elischer <julian@FreeBSD.org>

Add changes and code to implement a functional DEVFS.
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.

/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.

Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others

This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.

When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 26cf9c3b 27-Mar-1998 Peter Wemm <peter@FreeBSD.org>

Enable the use of soft updates on the root filesystem. Previously, the
softdep mode could only be activated on the initial mount of a filesystem
and then only if it was a read-write mount. A 'mount -r' (as done in the
rootfs mount) followed by a 'mount -u' to convert to read-write didn't
start softdep mode.


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 16337c2e 07-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Fixed missing simple_lock() in ffs_mountfs().


# 34bdbbd0 01-Mar-1998 Mike Smith <msmith@FreeBSD.org>

The intent is to get rid of WILLRELE in vnode_if.src by making
a complement to all ops that return a vpp, VFS_VRELE. This is
initially only for file systems that implement the following ops
that do a WILLRELE:

vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link,
vop_rename, vop_mkdir, vop_rmdir, vop_symlink

This is initial DNA that doesn't do anything yet. VFS_VRELE is
implemented but not called.

A default vfs_vrele was created for fs implementations that use the
standard vnode management routines.

VFS_VRELE implementations were made for the following file systems:

Standard (vfs_vrele)
ffs mfs nfs msdosfs devfs ext2fs

Custom
union umapfs

Just EOPNOTSUPP
fdesc procfs kernfs portal cd9660

These implementations may change as VOP changes are implemented.

In the next phase, in the vop implementations calls to vrele and the vrele
part of vput will be moved to the top layer vfs_vnops and made visible
to all layers. vput will be replaced by unlock in these cases. Unlocking
will still be done in the per fs layer but the refcount decrement will be
triggered at the top because it doesn't hurt to hold a vnode reference a
little longer. This will have minimal impact on the structure of the
existing code.

This will only be done for vnode arguments that are released by the various
fs vop implementations.

Wider use of VFS_VRELE will likely require restructuring of the code.

Reviewed by: phk, dyson, terry et. al.
Submitted by: Michael Hancock <michaelh@cet.co.jp>


# c9b99213 24-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Fixed missing permissions checking for mounting by non-root.

There is now less need for the vfs.usermount sysctl. msdosfs already
has this change, modulo a missing LK_RETRY, via NetBSD. At least
ext2fs is missing this and many other changes from Lite2.

Obtained from: Lite2


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 9cfcd011 01-Feb-1998 John Dyson <dyson@FreeBSD.org>

Back out recent laptop sync changes. They had significant errors.


# de1050d8 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Support more intelligent sync operations for MNT_NOATIME.
PR: kern/5577
Submitted by: Craig Leres <leres@ee.lbl.gov>


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# b1f4a44b 11-Nov-1997 Julian Elischer <julian@FreeBSD.org>

Reviewed by: various.

Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.


# 987f5696 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


# cec0f20c 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 0be6b890 10-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Add type arg to ffs_mountfs and avoid examining v_tag to find out
if MFS is getting a free ride.

Use generic ufs_reclaim().


# 81bca6dd 27-Sep-1997 KATO Takenori <kato@FreeBSD.org>

Clustered read and write are switched at mount-option level.

1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW
bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread
and vfs.foo.doclusterwrite are deleted. Only mount option can
control clustered I/O from userland.
2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR
and D_CLUSTERW bits of the d_flags member in the block device switch
table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR /
MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and
MNT_NOCLUSTERW cannot be cleared from userland.
3. Vnode driver disables both clustered read and write.
4. Union filesystem disables clutered write.

Reviewed by: bde


# 41fadeeb 07-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed yet more vestiges of config-time swap configuration and/or
cleaned up nearby cruft.


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# fce002fd 24-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include
it when it is not used. In most cases, the reasons for including it
went away when the special ioctl headers became self-sufficient.


# 8f89943e 23-Mar-1997 Guido van Rooij <guido@FreeBSD.org>

Add generation number randomization. Newly created filesystems wil now
automatically have random generation numbers. The kenel way of handling those
also changed. Further it is advised to run fsirand on all your nfs exported
filesystems. the code is mostly copied from OpenBSD, with the randomization
chanegd to use /dev/urandom
Reviewed by: Garrett
Obtained from: OpenBSD


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 3c816944 21-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Fixed some invalid (non-atomic) accesses to `time', mostly ones of the
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.


# cc9d8990 18-Mar-1997 Peter Wemm <peter@FreeBSD.org>

Restore the lost MNT_LOCAL flag twiddle. Lite2 has a different mechanism
of setting it (compiled into vfs_conf.c), but we have a dynamic system
in place. This could probably be better done via a runtime configure
flag in the VFS_SET() VFS declaration, perhaps VFCF_LOCAL, and have the
VFS code propagate this down into MNT_LOCAL at mount time. The other FS's
would need to be updated, havinf UFS and MSDOSFS filesystems without
MNT_LOCAL breaks a few things.. the man page rebuild scans for local
filesystems and currently fails, I suspect that other tools like find
and tar with their "local filesystem only" modes might be affected.


# 6ae83587 15-Mar-1997 Søren Schmidt <sos@FreeBSD.org>

Fix support for != 512 byte sector devices.
Restores the use of SBLOCK instead of the BSOFF/sectorsize calculation.
Using SBLOCK is bogus however in that it uses DEV_BSIZE instead of
the actual sector size, but that is taken care of in other places.
Changing the SBLOCK would be better, but it affects the system
in other places, and doing it this way makes it possible to
use filesystems that was made before the lite2 merge.


# 5ace3b26 08-Mar-1997 Mike Pritchard <mpp@FreeBSD.org>

Update a number of panic messages to reflect the actual name
of the routine that caused the panic.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 3f6f17ee 12-Nov-1996 Julian Elischer <julian@FreeBSD.org>

Submitted by: Archie and me.

We encountered an interesting situation where the superblock for
a file system got written to disk with the "fs_fmod" flag set to
one. It appears that this flag is normally supposed to be cleared
during ffs_sync(), but we experienced a crash, or some other weird
occurrence that left it on the disk set to 1.

Later this partition was mounted read-only... and the fs_fmod
field was never cleared, causing ffs_sync() to panic "rofs mod"
when trying to unmount that filesystem (ffs_vfsops.c: line 790).

fix:
set this bit to 0 when you load the superblock from disk.
(see more complete mail on this to hackers)


# c645dc12 07-Sep-1996 John Dyson <dyson@FreeBSD.org>

Fix a VOP_UNLOCK panic when using options DIAGNOSTIC during dismount.


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 2f9bae59 11-Jun-1996 David Greenman <dg@FreeBSD.org>

Moved the fsnode MALLOC to before the call to getnewvnode() so that the
process won't possibly block before filling in the fsnode pointer (v_data)
which might be dereferenced during a sync since the vnode is put on the
mnt_vnodelist by getnewvnode.

Pointed out by Matt Day <mday@artisoft.com>


# e1eec28a 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 847a3ba7 02-Mar-1996 John Dyson <dyson@FreeBSD.org>

Handle the bogus device that MFS uses as its VBLK device. We now don't
try to VMIO open it on MFS mounts. This will fix the mfs_badops
panic.


# 91477adc 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

Enable VMIO for non-VDIR metadata and block device.


# e6302eab 25-Feb-1996 Bruce Evans <bde@FreeBSD.org>

Removed vestigial support for the obsolete FIFO option. In ext2fs
it caused null pointer panics for all fifo operations unless FIFO
was defined.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 51ea8b57 14-Jan-1996 Bruce Evans <bde@FreeBSD.org>

Partially fixed negative and truncated "Avail" counts in df output.
This fixes PR943.

ffs/ffs_vfsops.c:
ffs_statfs() multiplied by (100 - minfree) as part of calculating the
minfree percentage (complemented in 100%), so with the standard minfree
of 8, it was broken for file systems of size >= 1TB/92 = 11GB. Use the
standard freespace() macro instead. This also fixes a rounding bug (the
"Avail" count was sometimes 1 too small).

ffs/* (not fixed):
The freespace() macro multiplies by minfree, so with the standard
minfree of 8, it is broken for file systems of size >= 1TB/8 = 128GB.
This bug is more serious since it affects block allocation.

ffs/ffs_alloc.c (not fixed):
Ordinary users are sometimes allowed to allocate 1 (partial) block
too many so that the "Avail" count goes negative. E.g., if there is
1 fragment available and the file is fairly large, one more full
block is allocated.

df/df.c:
ufs_df() used/uses essentially the same code as ffs_statfs(), so it
had/has the same bugs.

ufs_df() gratuitously replaced "Avail" counts of < 0 by 0, so it
gave different results for non-mounted file systems in this case.


# 01733a9b 05-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert QUOTA to new-style option.


# b8dce649 17-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# c03020b2 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Fix compiler warnings.


# 2b14f991 28-Aug-1995 Julian Elischer <julian@FreeBSD.org>

Reviewed by: julian with quick glances by bruce and others
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..

NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..

certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)

The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.


# 628641f8 11-Aug-1995 David Greenman <dg@FreeBSD.org>

Converted mountlist to a CIRCLEQ.

Partially obtained from: 4.4BSD-Lite2


# b6dedae6 06-Aug-1995 David Greenman <dg@FreeBSD.org>

Removed redundant call to vm_object_page_clean - this is already done
in vfs_msync().


# 8997d94f 21-Jul-1995 David Greenman <dg@FreeBSD.org>

Since ufs_ihashget can block, the lock must be checked for each time
the function returns. Also, moved lock into .bss and made minor cosmetic
changes.

Submitted by: Bruce Evans


# 2094ddb6 20-Jul-1995 David Greenman <dg@FreeBSD.org>

Implement a lock in ffs_vget to prevent a race condition where two processes
try allocate the same inode/vnode, causing a duplicate.

Submitted by: Matt Dillon, slightly reworked by me.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# aa2cabb9 27-Jun-1995 David Greenman <dg@FreeBSD.org>

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 2976b7f1 18-May-1995 David Greenman <dg@FreeBSD.org>

NFS diskless operation was broken because swapdev_vp wasn't initialized.
These changes solve the problem in a general way by moving the
initialization out of the individual fs_mountroot's and into swaponvp().

Submitted by: Poul-Henning Kamp


# 1469eec8 15-May-1995 David Greenman <dg@FreeBSD.org>

Fixed incompleteness that would allow dirty filesystems to get mounted
when the single user shell was terminated. These changes disallow mounting
or R/W upgrading filesystems that are dirty unless "-f" (force) option
is used with mount. /etc/rc has been modified to abort the startup if
one or more non-nfs partitions fail to mount.

Reviewed by: Poul-Henning Kamp, Rod Grimes


# f33775af 01-May-1995 John Dyson <dyson@FreeBSD.org>

Limit filesize to the amount that the VM system can currently handle
(2GB). If this limit is not imposed, then filesystem corruption will
ensue when files larger than 2GB are created. This is temporary,
and the underlying limitation will be removed later.


# 81c6e3e5 10-Apr-1995 David Greenman <dg@FreeBSD.org>

Handle the "syncing VCHR vnode hang" problem a little differently; just
don't lock the vnode - it doesn't appear to ever be necessary for VCHR
vnode/inodes. This fixes a bug introduced in the previous commit that
caused tty timestamps to act strange (causing 'w' and 'finger' to show
the tty wasn't idle when it may have been for hours).


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 3aa12267 28-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) that I didn't notice when I fixed
"all" such warnings before.


# 4e83f749 18-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't sync the inode date changes of character special devices
during the FS sync. The system would appear to hang momentarily
if there was a large backlog of I/O. This is because the vnode
remains locked during the output - preventing normal character
I/O. The problem was exacerbated by the FFS contiguous block
allocation fixes and a semi-broken disksort(). The inode/date
will still be synced during a normal FS dismount and whenever
the inode is changed for other reasons.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 36633bf4 14-Nov-1994 Bruce Evans <bde@FreeBSD.org>

Undo a previous change. <sys/disklabel.h> was broken, not these files.


# 94a92413 27-Oct-1994 Jordan K. Hubbard <jkh@FreeBSD.org>

From: fredriks@mcs.com (Lars Fredriksen)
...
It turns out that these files do not include <sys/dkbad.h> before
<sys/disklabel.h>.
Submitted by: fredriks


# 901ba606 21-Oct-1994 David Greenman <dg@FreeBSD.org>

Restrict fs_maxfilesize to 2^40, and check against this in ffs_truncate().
This is part of a bug fix from Kirk McKusick to work around problems in FFS
related to the blkno of a 64bit offset not fitting into an int. Note the
proper solution would be to deal with 64bit block numbers, but doing this
would require sweeping changes; some other day perhaps.

Submitted by: Marshall Kirk McKusick


# c1d9efcb 09-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. make gcc less noisy. Still some way to go here.


# c9671602 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics for gcc -Wall. A couple of unused "int i"'s removed and a couple of
prototypes added. And the usual () work.


# 862cdb8e 21-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Call ffs ``ufs'' for the benefit of poor, confused user-land programs.


# c901836c 20-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Implemented loadable VFS modules, and made most existing filesystems
loadable. (NFS is a notable exception.)


# e0e9c421 20-Aug-1994 David Greenman <dg@FreeBSD.org>

Implemented filesystem clean bit via:

machdep.c:
Changed printf's a little and call vfs_unmountall() if the sync was
successful.

cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c:
Allow dismount of root FS. It is now disallowed at a higher level.

vfs_conf.c:
Removed unused rootfs global.

vfs_subr.c:
Added new routines vfs_unmountall and vfs_unmountroot. Filesystems
are now dismounted if the machine is properly rebooted.

ffs_vfsops.c:
Toggle clean bit at the appropriate places. Print warning if an
unclean FS is mounted.

ffs_vfsops.c, lfs_vfsops.c:
Fix bug in selecting proper flags for VOP_CLOSE().

vfs_syscalls.c:
Disallow dismounting root FS via umount syscall.


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources