History log of /freebsd-current/sys/ufs/ffs/ffs_snapshot.c
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 220427da 12-Aug-2023 Kirk McKusick <mckusick@FreeBSD.org>

Set UFS/FFS file type to snapshot before changing its block pointers.

A UFS/FFS snapshot file is identified with the SF_SNAPSHOT
flag to identify it as a snapshot. This flag needs to be
set before setting some of its block pointers to the special
values BLK_SNAP and BLK_NOCOPY. If the snapshot creation fails
and we call VOP_REMOVE(), the SF_SNAPSHOT flag will let the
remove routine know that the special block pointer values need
to be rolled back before attempting deletion of the file.

Also ensure that an fsck is required after setting superblock
values in the ffs_checkcgintegrity() routine.

Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# c52b5d16 09-Aug-2023 Kirk McKusick <mckusick@FreeBSD.org>

Remove a partial UFS/FFS snapshot if it fails to build successfully.

When taking a UFS/FFS snapshot, it may not succeed for example if the
filesystem is too full to hold it. When a snapshot is unable to be
successfully taken, the partial snapshot should be removed.

Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 831b1ff7 27-Jul-2023 Kirk McKusick <mckusick@FreeBSD.org>

UFS/FFS: Migrate to modern uintXX_t from u_intXX_t.

As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.

No functional changes intended.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# ba8cc6d7 12-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use __enum_uint8 for vtype and vstate

This whacks hackery around only reading v_type once.

Bump __FreeBSD_version to 1400093


# f1549d7d 13-Jun-2023 Kirk McKusick <mckusick@FreeBSD.org>

Write out corrected superblock when creating a UFS/FFS snapshot.

When taking a snapshot on a UFS version 1 filesystem we need to
call ffs_oldfscompat_write() to unwind any in-memory changes that
were made to the superblock before writing it. The cause of this bug
was that the trimmed down maximum file size was not being reverted.

PR: 271352
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 78f41298 12-Nov-2022 Kirk McKusick <mckusick@FreeBSD.org>

Enable taking snapshots on UFS/FFS filesystems using journaled soft updates.

All the needed infrastructure updates have been made to allow
snapshots to be taken on UFS/FFS filesystems that are using journaled
soft updates. The most immediate benefit is the ability to use a
snapshot to take a consistent filesystem dump on a live filesystem
using the -L option to dump(8).

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36491


# 221da3e9 29-Sep-2022 Kirk McKusick <mckusick@FreeBSD.org>

Fix an incorrectly placed parenthesis.

While syntactically correct and even looking correct, it was definitely
not providing the desired result. And it has been this way for nearly
twenty years.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# c9dde6f0 26-Jul-2022 Dimitry Andric <dim@FreeBSD.org>

Fix unused variable warning in ffs_snapshot.c

With clang 15, the following -Werror warning is produced:

sys/ufs/ffs/ffs_snapshot.c:204:7: error: variable 'redo' set but not used [-Werror,-Wunused-but-set-variable]
long redo = 0, snaplistsize = 0;
^

The 'redo' variable is only used when DIAGNOSTIC is defined. Ensure it
is only declared and set in that case.

MFC after: 3 days


# 064e6b43 13-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Rewrite function definitions in the UFS/FFS code base with identifier lists.

The K&R style in UFS and other places in the tree's days are numbered
as this syntax is removed in C2x proposal N2432:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf

Though running to nearly 6000 lines of diffs this update should cause
no functional change to the code.

Requested by: Warner Losh
MFC after: 2 weeks


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 4559700a 22-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

ffs_snapblkfree(): add a comment explaining lockmgr invocation

Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072


# ddf162d1 28-Jan-2022 Kirk McKusick <mckusick@FreeBSD.org>

ufs: handle LoR between snap lock and vnode lock

When a filesystem is mounted all of its associated snapshots must
be activated. It first allocates a snapshot lock (snaplk) that will
be shared by all the snapshot vnodes associated with the filesystem.
As part of each snapshot file activation, it must replace its own
ufs vnode lock with the snaplk. In this way acquiring the snaplk
gives exclusive access to all the snapshots for the filesystem.

A write to a ufs vnode first acquires the ufs vnode lock for the
file to be written then acquires the snaplk. Once it has the snaplk,
it can check all the snapshots to see if any of them needs to make
a copy of the block that is about to be written. This ffs_copyonwrite()
code path establishes the ufs vnode followed by snaplk locking
order.

When a filesystem is unmounted it has to release all of its snapshot
vnodes. Part of doing the release is to revert the snapshot vnode
from using the snaplk to using its original vnode lock. While holding
the snaplk, the vnode lock has to be acquired, the vnode updated
to reference it, then the snaplk released. Acquiring the vnode lock
while holding the snaplk violates the ufs vnode then snaplk order.
Because the vnode lock is unused, using LK_EXCLUSIVE | LK_NOWAIT
to acquire it will always succeed and the LK_NOWAIT prevents the
reverse lock order from being recorded.

This change was made in January 2021 (173779b98f) to avoid an LOR
violation in ffs_snapshot_unmount(). The same LOR issue was recently
found again when removing a snapshot in ffs_snapremove() which must
also revert the snaplk to the original vnode lock as part of freeing it.

The unwind in ffs_snapremove() deals with the case in which the
snaplk is held as a recursive lock holding multiple references.
Specifically an equal number of references are made on the vnode
lock. This change factors out the lock reversion operations into a
new function revert_snaplock() which handles both the recursive
locks and avoids the LOR. The new revert_snaplock() function is
then used in both ffs_snapshot_unmount() and in ffs_snapremove().

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33946


# 324150d6 02-Jan-2022 Jessica Clarke <jrtc27@FreeBSD.org>

ufs: Avoid subobject overflow in snapshot expunge code

The code here tries to be smart and zeroes out both di_db and di_ib with
a single bzero call, thereby overrunning the di_db subobject. This is
fine on most architectures, if a little dodgy. However, on CHERI, the
compiler can optionally restrict the bounds on pointers to subobjects to
just that subobject, in order to mitigate intra-object buffer overflows,
and this is enabled in CheriBSD's pure-capability kernels.

Instead, use separate bzero calls for each array, and let the compiler
optimise it as it sees fit; even if it's not generating inline zeroing
code, Clang will happily optimise two consecutive bzero's to a single
larger call.

Reviewed by: mckusick
Differential Revision: https://reviews.freebsd.org/D33651


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# eede22d6 01-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_snapshot: do not assert that um_devvp is locked

It is not, and the lock is not needed there

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# 4a365e86 27-Sep-2021 Kirk McKusick <mckusick@FreeBSD.org>

Avoid "consumer not attached in g_io_request" panic when disk lost
while using a UFS snapshot.

The UFS filesystem supports snapshots. Each snapshot is a file whose
contents are a frozen image of the disk partition on which the filesystem
resides. Each time an existing block in the filesystem is modified,
the filesystem checks whether that block was in use at the time that
the snapshot was taken. If so, and if it has not already been copied,
a new block is allocated from among the blocks that were not in use
at the time that the snapshot was taken and placed in the snapshot file
to replace the entry that has not yet been copied. The previous contents
of the block are copied to the newly allocated snapshot file block,
and the write to the original is then allowed to proceed.

The block allocation is done using the usual UFS_BALLOC() routine
which allocates the needed block in the snapshot and returns a
buffer that is set up to write data into the newly allocated block.
In usual filesystem operation, the contents for the new block is
copied from user space into the buffer and the buffer is then written
to the file using bwrite(), bawrite(), or bdwrite(). In the case of a
snapshot the new block must be filled from the disk block that is about
to be rewritten. The snapshot routine has a function readblock() that
it uses to read the `about to be rewritten' disk block.

/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct bio *bip;
struct fs *fs;

ip = VTOI(snapvp);
fs = ITOFS(ip);

bip = g_alloc_bio();
bip->bio_cmd = BIO_READ;
bip->bio_offset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn)));
bip->bio_data = bp->b_data;
bip->bio_length = bp->b_bcount;
bip->bio_done = NULL;

g_io_request(bip, ITODEVVP(ip)->v_bufobj.bo_private);
bp->b_error = biowait(bip, "snaprdb");
g_destroy_bio(bip);
return (bp->b_error);
}

When the underlying disk fails, its GEOM module is removed.
Subsequent attempts to access it should return the ENXIO error.
The functionality of checking for the lost disk and returning
ENXIO is handled by the g_vfs_strategy() routine:

void
g_vfs_strategy(struct bufobj *bo, struct buf *bp)
{
struct g_vfs_softc *sc;
struct g_consumer *cp;
struct bio *bip;

cp = bo->bo_private;
sc = cp->geom->softc;

/*
* If the provider has orphaned us, just return ENXIO.
*/
mtx_lock(&sc->sc_mtx);
if (sc->sc_orphaned || sc->sc_enxio_active) {
mtx_unlock(&sc->sc_mtx);
bp->b_error = ENXIO;
bp->b_ioflags |= BIO_ERROR;
bufdone(bp);
return;
}
sc->sc_active++;
mtx_unlock(&sc->sc_mtx);

bip = g_alloc_bio();
bip->bio_cmd = bp->b_iocmd;
bip->bio_offset = bp->b_iooffset;
bip->bio_length = bp->b_bcount;
bdata2bio(bp, bip);
if ((bp->b_flags & B_BARRIER) != 0) {
bip->bio_flags |= BIO_ORDERED;
bp->b_flags &= ~B_BARRIER;
}
if (bp->b_iocmd == BIO_SPEEDUP)
bip->bio_flags |= bp->b_ioflags;
bip->bio_done = g_vfs_done;
bip->bio_caller2 = bp;
g_io_request(bip, cp);
}

Only after checking that the device is present does it construct
the "bio" request and call g_io_request(). When readblock()
constructs its own "bio" request and calls g_io_request() directly
it panics with "consumer not attached in g_io_request" when the
underlying device no longer exists.

The fix is to have readblock() call g_vfs_strategy() rather than
constructing its own "bio" request:

/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct fs *fs;

ip = VTOI(snapvp);
fs = ITOFS(ip);

bp->b_iocmd = BIO_READ;
bp->b_iooffset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn)));
bp->b_iodone = bdone;
g_vfs_strategy(&ITODEVVP(ip)->v_bufobj, bp);
bufwait(bp);
return (bp->b_error);
}

Here it uses the buffer that will eventually be written to the disk.
The g_vfs_strategy() routine uses four parts of the buffer: b_bcount,
b_iocmd, b_iooffset, and b_data.

The b_bcount field is already correctly set for the buffer. It is
safe to set the b_iocmd and b_iooffset fields as they are set
correctly when the later write is done. The write path will also
clear the B_DONE flag that our use of the buffer will set. The
b_iodone callback has to be set to bdone() which will do just
notification that the I/O is done in bufdone(). The rest of
bufdone() includes things like processing the softdeps associated
with the buffer should not be done until the buffer has been
written. Bufdone() will set b_iodone back to NULL after using it,
so the full bufdone() processing will be done when the buffer is
written. The final change from the previous version of readblock()
is that it used the b_data for the destination of the read while
g_vfs_strategy() uses the bdata2bio() function to take advantage
of VMIO when it is available.

Differential revision: https://reviews.freebsd.org/D32150
Reviewed by: kib, chs
MFC after: 1 week
Sponsored by: Netflix


# d7770a54 18-Sep-2021 Kirk McKusick <mckusick@FreeBSD.org>

Eliminate snaplk / bufwait LOR when creating UFS snapshots

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of creating a
new UFS snapshot, it has to have its individual vnode lock replaced
with the filesystem's snapshot lock.

The lock order for regular vnodes with respect to buffer locks is that
they must first acquire the vnode lock, then a buffer lock. The order
for the snapshot lock is reversed: a buffer lock must be acquired before
the snapshot lock.

When creating a new snapshot, the snapshot file must retain its vnode
lock until it has allocated all the blocks that it needs before
switching to the snapshot lock. This update moves one final piece of
the initial snapshot block allocation so that it is done before the
newly created snapshot is switched to use the snapshot lock.

Reported by: Witness code
MFC after: 1 week
Sponsored by: Netflix


# c31480a1 14-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

UFS snapshots: properly set the vm object size.

Citing Kirk:
The previous code [before 8563de2f2799b2cb -- kib] did not call
vnode_pager_setsize() but worked because later in ffs_snapshot() it
does a UFS_WRITE() to output the snaplist. Previously the UFS_WRITE()
allocated the extra block at the end of the file which caused it to do
the needed vnode_pager_setsize(). But the new code had already allocated
the extra block, so UFS_WRITE() did not extend the size and thus did not
do the vnode_pager_setsize().

PR: 253158
Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Reviewed by: mckusick
Tested by: cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 8563de2f 11-Feb-2021 Kirk McKusick <mckusick@FreeBSD.org>

Fix bug 253158 - Panic: snapacct_ufs2: bad block - mksnap_ffs(8) crash

The panic reported in 253158 arises because the /mnt/.snap/.factory
snapshot allocated the last block in the filesystem. The snapshot
code allocates the last block in the filesystem as a way of setting
its length to be the size of the filesystem. Part of taking a
snapshot is to remove all the earlier snapshots from the image of
the newest snapshot so that newer snapshots will not claim the blocks
of the earlier snapshots. The panic occurs when the new snapshot
finds that both it and an earlier snapshot claim the same block.

The fix is to set the size of the snapshot to be one block after
the last block in the filesystem. This block can never be allocated
since it is not a valid block in the filesystem. This extra block
is used as a place to store the initial list of blocks that the
snapshot has already copied and is used to avoid a deadlock in and
speed up the ffs_copyonwrite() function.

Reported by: Harald Schmalzbauer
Tested by: Peter Holm
PR: 253158
Sponsored by: Netflix


# b59a8e63 30-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Stop ignoring ERELOOKUP from VOP_INACTIVE()

When possible, relock the vnode and retry inactivation. Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# be44e986 24-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_snapshot: use VOP_VPUT_PAIR after VOP_CREATE.

If the snapshot embrio was reclaimed under us, return error outright.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 173779b9 15-Jan-2021 Kirk McKusick <mckusick@FreeBSD.org>

Eliminate lock order reversal in UFS when unmounting filesystems
with snapshots.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

The lock order reversal happens because vnode locks must be acquired
before snapshot locks. When unmounting we must lock both the snapshot
lock and the vnode lock before swapping them so that the vnode will
be continuously locked during the swap. For each vnode representing
a snapshot, we must first acquire the snapshot lock to ensure
exclusive access to it and its original lock. We then face a lock
order reversal when we try to acquire the original vnode lock. The
problem is eliminated by doing a non-blocking exclusive lock on the
original lock which will always succeed since there are no users
of that lock.

Sponsored by: Netflix


# ace3d947 23-Dec-2020 Mark Johnston <markj@FreeBSD.org>

ffs: Avoid out-of-bounds accesses in the fs_active bitmap

We use a bitmap to track which cylinder groups have changed between
snapshot creation and filesystem suspension. The "legs" of the bitmap
are four bytes wide (see ACTIVESET()) so we must round up the allocation
size to a multiple of four bytes.

I believe this bug is harmless since UMA/kmem_* will both pad the
allocation and zero the full allocation. Note that malloc() does inline
zeroing when the allocation size is known at compile-time.

Reported by: pho (using KASAN)
Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27731


# 8a1509e4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Handle LoR in flush_pagedep_deps().

When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock. Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed. This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps(). The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed. All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP. For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied. Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes. [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged. A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive(). In this case, dropping the
vnode lock is not safe. Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT. ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 34816cb9 18-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Move the pointers stored in the superblock into a separate
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.

This cleanup of the superblock eases the task of the commit
that immediately follows this one.

Suggested by: brooks
Reviewed by: kib
PR: 246983
Sponsored by: Netflix


# 52488b51 04-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Further evaluation of the POSIX spec for fdatasync() shows that it
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.

This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.

Reviewed by: kib
MFC with: -r361785
Differential revision: https://reviews.freebsd.org/D25072


# d93762b9 24-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop handling VI_OWEINACT in vget

vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by: jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23184


# ac4ec141 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add a setter for inode i_flag field

This will be used later to add vnodes to the lazy list.

Reviewed by: kib (previous version), jeff
Tested by: pho (in a larger patch)
Differential Revision: https://reviews.freebsd.org/D22994


# 8dbc6352 04-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop thread argument from vinactive


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 2ac044e6 27-Nov-2019 Chuck Silvers <chs@FreeBSD.org>

As part of creating a snapshot, set fs->fs_fmod to 0 in the snapshot image
because nothing ever changes this field for read-only mounts and we want
to verify that it is still 0 when we unmount.

Reviewed by: mckusick
Approved by: mckusick (mentor)
Sponsored by: Netflix


# 44d37182 03-Oct-2019 Kirk McKusick <mckusick@FreeBSD.org>

Update ffs_getcg() function to accept a flags parameter to be passed
to breadn_flags() in preparation for later need when doing forcible
unmount when disk dies or is removed.

No functional change.

Sponsored by: Netflix


# f3cf6225 06-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

ufs: Remove redundant brelse() after r294954

Same automation.

No functional change.


# f89d2072 17-Jun-2019 Xin LI <delphij@FreeBSD.org>

Separate kernel crc32() implementation to its own header (gsb_crc32.h) and
rename the source to gsb_crc32.c.

This is a prerequisite of unifying kernel zlib instances.

PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193


# af6aeacb 28-May-2019 Kirk McKusick <mckusick@FreeBSD.org>

Convert use of UFS-specific #ifdef DEBUG to DIAGNOSTIC or INVARIANTS
as appropriate. No functional change intended.

Suggested-by: markj


# c0029546 27-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by: Christopher Krah <krah@protonmail.com>
Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix

M sys/fs/ext2fs/ext2_vnops.c
M sys/kern/vfs_subr.c
M sys/ufs/ffs/ffs_snapshot.c
M sys/ufs/ufs/ufs_vnops.c


# 8690d4de 23-Dec-2018 Konstantin Belousov <kib@FreeBSD.org>

Allocate v_object for the new snapshot vnode.

The vnode is not opened, so it ends up with the malloced buffers otherwise.

Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 8f829a5c 11-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS. This change adds a
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.

Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.

Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# ade67b50 25-Nov-2018 Kirk McKusick <mckusick@FreeBSD.org>

Calculate updated superblock check-hash before writing it into the snapshot.
This corrects a bug that prevented snapshots from being mounted due to a
superblock check-hash failure.

Reported by: Brennan Vincent <brennan@umanwizard.com>
Tested by: Peter Holm (pho@)
Sponsored by: Netflix


# 9fc5d538 13-Nov-2018 Kirk McKusick <mckusick@FreeBSD.org>

In preparation for adding inode check-hashes, clean up and
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.

Convert the utilities that had been using the undocumented
interface to use the new documented interface.

No functional change (as for now the libufs library does not
do inode check-hashes).

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 7e038bc2 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by: Peter Holm
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix


# cc91864c 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.


# 68c49bcc 06-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix


# 6040822c 30-Jul-2018 Alan Somers <asomers@FreeBSD.org>

Make timespecadd(3) and friends public

The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725


# f9834d10 24-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

Revert r327781, r328093, r328056:
ufs|ext2fs: Revert uses of mallocarray(9).

These aren't really useful: drop them.
Variable unsigning will be brought again later.


# 90b618f3 17-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: use mallocarray(9).

Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.

Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by: mckusick


# 5f943cca 23-Dec-2017 Alexander Kabaev <kan@FreeBSD.org>

Remove dead initialization of the inode pointer.

The pointer gets initialized again later in the code. This also
improves code style(9).


# fe267a55 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: general adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.


# 75e3597a 21-Sep-2017 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.

sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)

sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)

sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)


# 33bbdde0 31-Jul-2017 Kirk McKusick <mckusick@FreeBSD.org>

Avoid reading a snapshot block when it is already in the cache.
Update the use of the B_CACHE flag (since the May 1999 commit
that made it the correct test here).

Reported by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week


# 5cf14660 21-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Improve publication of the newly allocated snapdata.

For freshly allocated snapdata, Lock sn_lock in advance, so
si_snapdata readers see the locked snapdata and not race.

For existing snapdata, if the thread was put to sleep waiting for
sn_lock, re-read si_snapdata. This either closes the race or makes
the reliance on LK_DRAIN less important.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# c5364714 21-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Unlock correct lock in ffs_snapblkfree().

It is possible for ffs_snapblkfree() to race and lock snaplock while
the devvp snapdata is instantiated, but no snapshots exist. In this
case the loop over snapshots in ffs_snapblkfree() is not executed, and
the local variable vp is left initialized to NULL.

Unlock using &sn->sn_lock and not vp->v_vnlock. For the inodes on the
snapshot list, the locks are same.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# f2e6bf5c 21-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Account for lock recursion when transfering snaplock to the vnode lock
in ffs_snapremove().

Apparently ffs_snapremove() may be called with the snap lock recursed,
at least one trace demonstrated this when snapshot vnode was unlinked
while synced. It was inactivated from the syncer thread.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 9c4f551e 28-Jun-2017 Kirk McKusick <mckusick@FreeBSD.org>

Create a new function ffs_getcg() to read in and verify a cylinder
group. Change all code points that open-coded this functionality
to use the new function. This commit is a refactoring with no
change in functionality.

In the future this change allows more robust checking of cylinder
group reads along the lines discussed in the hardening UFS session
at BSDCan (retry I/O, add checksums, etc). For more detail see the
session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS

Reviewed by: kib


# 1dc349ab 15-Feb-2017 Ed Maste <emaste@FreeBSD.org>

prefix UFS symbols with UFS_ to reduce namespace pollution

Specifically:
ROOTINO -> UFS_ROOTINO
WINO -> UFS_WINO
NXADDR -> UFS_NXADDR
NDADDR -> UFS_NDADDR
NIADDR -> UFS_NIADDR
MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by: kib, mckusick
Obtained from: NetBSD
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9536


# 8660b707 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the __bo_vnode field from struct vnode

The pointer can be obtained using __containerof instead.

Reviewed by: kib


# e1db6897 17-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Reduce size of ufs inode.

Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding. To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2. Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by: pho (previous version)
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 411455a8 10-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after: 1 month


# abafa4db 10-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: replace 0 with NULL for pointers.

While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by: mckusick
MFC after: 2 weeks


# 6c21f6ed 18-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter(). Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by: Chris Torek <torek@pi-coral.com>
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 4896af9f 01-Mar-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: small formatting fixes.

Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after: 3 days
Reviewed by: mckusick


# 2aea094f 12-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Only copy as much bytes as there in superblock, instead of the full
block copy, when copying the superblock into the snapshot. UFS1 does
not align superblock on the block boundary, and bcopy runs off the end
of the buffer.

Reported by: Andre Albsmeier <Andre.Albsmeier@siemens.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cc3d8c35 09-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

There are several code sequences like
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 22a72260 30-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# ddd6b3fc 10-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# f99cb34c 01-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 91e94745 28-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD


# fc8fdae0 27-Sep-2012 Matthew D Fleming <mdf@FreeBSD.org>

Fix up kernel sources to be ready for a 64-bit ino_t.

Original code by: Gleb Kurtsou


# f7a3729c 22-Jul-2012 Kevin Lo <kevlo@FreeBSD.org>

Use NULL instead of 0 for pointers


# c52fd858 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# 71469bb3 17-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# ecb6e528 11-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after: 2 weeks


# 75a58389 24-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib


# 35338e60 01-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks


# 86b57150 16-Jan-2012 Kirk McKusick <mckusick@FreeBSD.org>

There are several bugs/hangs when trying to take a snapshot on a UFS/FFS
filesystem running with journaled soft updates. Until these problems
have been tracked down, return ENOTSUPP when an attempt is made to
take a snapshot on a filesystem running with journaled soft updates.

MFC after: 2 weeks


# 8cd680f8 27-Sep-2011 Kirk McKusick <mckusick@FreeBSD.org>

This update eliminates a lock-order reversal warning discovered
whle tracking down the system hang reported in kern/160662 and
corrected in revision 225806. The LOR is not the cause of the system
hang and indeed cannot cause an actual deadlock. However, it can
be easily eliminated by defering the acquisition of a buflock until
after all the vnode locks have been acquired.

Reported by: Hans Ottevanger
PR: kern/160662


# 6b3b8a21 27-Sep-2011 Kirk McKusick <mckusick@FreeBSD.org>

This update eliminates the system hang reported in kern/160662 when
taking a snapshot on a filesystem running with journaled soft updates.

Reported by: Hans Ottevanger
Fix verified by: Hans Ottevanger
PR: kern/160662


# 9957ac07 18-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Fixed dereference of a NULL pointer.

Reported by: Peter Holm


# 43a3cc77 15-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by: Jeff Roberson


# 9eb8728a 12-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Update to soft updates journaling to properly track freed blocks
that get claimed by snapshots.

Submitted by: Jeff Roberson
Tested by: Peter Holm


# 3eb6e131 09-Feb-2011 Alexander Leidinger <netchild@FreeBSD.org>

Add some FEATURE macros for some UFS features.

SU+J is not included as a FEATURE macro:
- it was not in the tree during the GSoC
- I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: kib
X-MFC after: to be determined in last commit with code from this project


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 8ef48de8 07-May-2010 Jeff Roberson <jeff@FreeBSD.org>

- Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
- Don't fsync() vnodes in prealloc if copy on write is in progress. It
is not safe to recurse back into the write path here.

Reported by: Vladimir Grebenschikov <vova@fbsd.ru>


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 7733cf8f 11-Feb-2010 Matt Jacob <mjacob@FreeBSD.org>

MFC a number of changes from head for ISP (203478,203463,203444,202418,201758,
201408,201325,200089,198822,197373,197372,197214,196162). Since one of those
changes was a semicolon cleanup from somebody else, this touches a lot more.


# c2ede4b3 07-Jan-2010 Martin Blapp <mbr@FreeBSD.org>

Remove extraneous semicolons, no functional changes.

Submitted by: Marc Balmer <marc@msys.ch>
MFC after: 1 week


# 885868cd 10-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


# 5bd65606 09-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after: 1 month


# f1c1cdbb 13-Nov-2008 Doug Ambrisko <ambrisko@FreeBSD.org>

For now on every 10 cyclinder groups flush the buffer cache to free
up space. If the buffer cache fills up then the disk systems can
grind to a halt. Better tuning can be figured out later.

Tested by: Tim, others and work
Reviewed by: Kostik Belousov
PR: 128832


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 4560452f 13-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

Sync up summary information for cylinder groups while data is already
in memory during snapshot creation. This improves the results of the
background fsck.

Submitted by: tegge
MFC after: 1 week


# 2814d5ba 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 57b4252e 30-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# 9c0cdb82 31-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Don't free snapdata structures when they are no longer in use.
Keeping the lockmgr lock valid allows us to switch the v_lock pointer
in snapshot vnodes between the embedded lockmgr lock and snapdata
lock without needing the vnode interlock to protect against races
- Keep unused snapdata structures in a list.
- Add a function to lock the devvp and allocate a snapdata to it or
acquire a new one without races. The old function was safe from
creation races because we set the mount flag when creating snapshots
and thus serializing them. However, it might have been subject to
destroying races.

Reviewed by: tegge


# 374ae2a3 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 0e9eb108 23-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 1102b89b 08-Nov-2007 David E. O'Brien <obrien@FreeBSD.org>

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


# 982d11f8 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 5b959aa4 10-Apr-2007 Konstantin Belousov <kib@FreeBSD.org>

Fix the NAMEI zone leak when snapshot was successfully created.

Reported and tested by: Peter Holm
MFC after: 2 weeks


# 04533fc6 04-Apr-2007 Xin LI <delphij@FreeBSD.org>

Use *_EMPTY macros when appropriate.


# 2cc7d26f 23-Jan-2007 Konstantin Belousov <kib@FreeBSD.org>

Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)


# db9b81ea 20-Jan-2007 Mike Pritchard <mpp@FreeBSD.org>

Quota system cleanup.

1) Do not do quota accounting for the actual quota data files
or for file system snapshot files ("system" files). This
prevents a deadlock descibed in PR kern/30958 if the kernel
ever has to grow the quota file. Snapshot files were already
exempt from the quota checks, but this change generalized the check.
2) Fix a cast that caused extremely large uids/gids to incorrectly
write the quota information to the data file at a truncated
value for a uint_t32 id value. The incorrect cast caused quota
files in this case to be around 4GB in size, with the correct cast
they can now be 131GB in size. Also related to PR kern/30958.
3) Check for what appear to be negative UIDs/GIDs and not account
for them. This prevents the quota files from becoming 131GB in
size and causing quotacheck to run forever at bootup. This could
also cause the kernel to try and expand the quota file, which might
deadlock due to the issue in #1. kern/30958 and kern/38156
(and some much older closed PR's).
4) With the deadlock problems gone, the kernel can now expand the
size of the quota database files if it needs to.
5) Pass in the i-node count change value to chkiq and chkiqchg as an
int, like it used to be before the common routine was split up
into 2 different routines to increase / decrease the i-node in-use
count. Prevents an underflow on the i-node count. Related
to PR kern/89247.
6) Prevent the block usage from growing slowly if a file system is
full and the write was denied due to that fact. PR kern/89247.

Some of these changes require an updated quotacheck to prevent
the creation of huge (131GB) quota data files (item #3).

#1/#4 probably fixes a lot of the random hangs when quotas are enabled,
possibly some of the jail hangs.


# ec7a247a 10-Oct-2006 Konstantin Belousov <kib@FreeBSD.org>

Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem
is suspending/suspended. Doing so may result in deadlock. Instead, set the
(new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.

Change the locking protocol in order to set the IN_ACCESS and timestamps
without upgrading shared vnode lock to exclusive (see comments in the
inode.h). Before that, inode was modified while holding only shared
lock.

Tested by: Peter Holm
Reviewed by: tegge, bde
Approved by: pjd (mentor)
MFC after: 3 weeks


# 9b65c22c 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Don't restore MNT_QUOTA bit in mnt_flag after snapshot creation,
closing a race between nmount() and quotactl().


# 5da56ddb 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


# 3f65847e 21-Aug-2006 Konstantin Belousov <kib@FreeBSD.org>

While checking for update of snapshot file in the ffs_copyonwrite,
first filter out metadata update. Otherwise, devfs vnode could be
erronously interpreted as ufs one, causing further check of i_flags
to use random memory.

PR: kern/100365
Debugged and fix described by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks


# e0cf7175 15-May-2006 Tor Egge <tegge@FreeBSD.org>

Read block hints list from last snapshot on the active snapshot list.


# d93d98d9 15-May-2006 Tor Egge <tegge@FreeBSD.org>

Copy last block on file system again after file system has been suspended.

Obtained from: NetBSD


# ae5d9f3b 15-May-2006 Tor Egge <tegge@FreeBSD.org>

Don't leak a locked buffer if last block on file system cannot be read.


# ebb78f64 15-May-2006 Tor Egge <tegge@FreeBSD.org>

Errors detected while file system is suspended should not trigger an
assertion failure.


# b405cb5e 13-May-2006 Tor Egge <tegge@FreeBSD.org>

Expunge traces of unlinked snapshot files when making a new snapshot.


# 0911ecff 05-May-2006 Tor Egge <tegge@FreeBSD.org>

Turn off disk quotas for snapshot files.


# 5b139b2d 05-May-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Set bio_done directly to NULL to indicate that we want to wait for the bio.
- Use biowait() instead of copying the code.

MFC after: 1 month


# d81daf63 02-May-2006 Tor Egge <tegge@FreeBSD.org>

Detect the snapshot file being prematurely unlinked.


# 5515ad42 02-May-2006 Tor Egge <tegge@FreeBSD.org>

A side effect of calling runningbufwakeup() is that bp->b_runningbufspace is
cleared. Save old value and restore bp->b_runningbufspace before returning
from ffs_copyonwrite().


# 6d94935d 02-May-2006 Tor Egge <tegge@FreeBSD.org>

Close a race when VOP_LOCK() on a snapshot file is attempted at the
same time as it is changed back into a normal file. The locker would
get the shared "snaplk" lock which would no longer be the correct lock
for the vnode.


# 3bbd6d8a 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs().

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.


# 8c86028f 19-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Ensure that vnode for directory isn't reclaimed before ffs_snapshot() has
completed expunging unlinked files. It could come back at another memory
location causing a lock order reversal.


# 8db35720 11-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.

Pointed out by: tegge


# ca2fa807 10-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Block secondary writes while expunging active unlinked files.

Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.

Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).

Perform deferred inactive handling after the file system is resumed.


# eb2ea105 01-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# 82be0a5a 09-Jan-2006 Tor Egge <tegge@FreeBSD.org>

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


# 5c65ae3a 05-Jan-2006 Warner Losh <imp@FreeBSD.org>

New option: NO_FFS_SNAPSHOT. I did this in p4 about the same time
that NetBSD implemented it independently of them (don't know which one
was actually first). This saves about 24k for those times you don't
need snapshot support (like when running off a ram disk, or in an
embedded environment where size matters).


# 021869b5 09-Oct-2005 Tor Egge <tegge@FreeBSD.org>

Reduce probability for a deadlock that can occur when a snapshot inode is
updated by a process holding the snapshot lock. Another process updating a
different inode in the same inodeblock will do copy on write checks and lock in
the opposite direction.

The snapshot code force a copy on write of these blocks manually (cf. start of
expunge_ufs[12]) and these inode blocks are later put on snapblklist.

This partial fix is to 'drain' the relevant ffs_copyonwrite() operation after
installing new snapblklist. This is not a 100% solution since a failed block
allocation can cause implicit fsync() which might deadlock before the new
snapblklist has been installed.


# d4d530da 09-Oct-2005 Tor Egge <tegge@FreeBSD.org>

Eliminate a deadlock that can occur when a dirty block belonging to a snapshot
file is flushed by a process not holding snaplk (e.g. bufdaemon). Another
process might hold snaplk and try to access the block due to ffs_copyonwrite
processing.


# 45f91051 09-Oct-2005 Tor Egge <tegge@FreeBSD.org>

Eliminate a deadlock that can occur during the cgaccount() processing due to
the cg map buffer being held when writing indirect blocks. The process ends up
in ffs_copyonwrite(), attempting to get snaplk while holding the cg map buffer
lock.

Another process might be in ffs_copyonwrite(), trying to allocate a new block
for a copy. It would hold snaplk while trying to get the cg map buffer lock.

Release the cg map buffer early and use the copy for most of the cgaccount
processing to avoid this deadlock.


# 17026ff6 09-Oct-2005 Tor Egge <tegge@FreeBSD.org>

Reduce the probability of low block numbers passed to ffs_snapblkfree() by
skipping the call from ffs_snapremove() if the block number is zero.

Simplify snapshot locking in ffs_copyonwrite() and ffs_snapblkfree() by using
the same locking protocol for low block numbers as for larger block numbers.
This removes a lock leak that could happen if vn_lock() succeeded after
lockmgr() failed in ffs_snapblkfree().

Check if snapshot is gone before retrying a lock in ffs_copyonwrite().


# 460858e9 01-Oct-2005 Don Lewis <truckman@FreeBSD.org>

Correct previous commit to fix the sense of the TDP_NORUNNINGBUF
check in ffs_copyonwrite() that is a precondition for calling
waitrunningbufspace().

Pointed out by: tegge
Pointy hat to: truckman
MFC after: 3 days


# bd3c2d86 30-Sep-2005 Don Lewis <truckman@FreeBSD.org>

Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().


# 6c8b634f 29-Sep-2005 Don Lewis <truckman@FreeBSD.org>

Un-staticize runningbufwakeup() and staticize updateproc.

Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>


# bcc8f66c 02-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Use M_ZERO rather than explicitly calling bzero().
- Don't intermingle direct calls to lockmgr and indirect calls through
VOPs. This will be important in the future.
- Dont lock the devvp's interlock just to release it on the next line by
passing LK_INTERLOCK to lockmgr.
- Restructure ffs_snapshot_unmount so we don't call free() with the
devvp's interlock locked.


# ec3db02a 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Set LK_NOSHARE for snapshot locks. snapshots require exclusive only
access.
- Remove the hack from ffs_lock() to implement LK_NOSHARE in a ffs
specific way.

Sponsored by: Isilon Systems, Inc.


# f247a524 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


# fdcc8227 12-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.

Sponsored by: Isilon Systems, Inc.


# 41766826 01-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Fix anoter dyslexic moment; an atomic_set_int should've become ACTIVESET,
not ACTIVECLEAR.

Submitted by: iedowse


# d5128ab2 19-Feb-2005 Xin LI <delphij@FreeBSD.org>

When clearing a fragment, it's possible that the length is zero.

Reviewed by: mckusick
MFC After: 1 week


# efd6d980 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


# 40854ff5 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 5cef9d6a 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Use the ufs lock to protect fs_active.

Sponsored By: Isilon Systems, Inc.


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 364ed814 09-Dec-2004 Kirk McKusick <mckusick@FreeBSD.org>

Fixes a bug that caused UFS2 filesystems bigger than 2TB to
prematurely report that they were full and/or to panic the kernel
with the message ``ffs_clusteralloc: allocated out of group''.

Submitted by: Henry Whincup <henry@jot.to>
MFC after: 1 week


# 8f25bad3 08-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Fix snapshot creation.


# 43920011 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move UFS from DEVFS backing to GEOM backing.

This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.

Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1

Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.

Details:

Add a geom consumer and a bufobj pointer to ufsmount.

Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.

In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.

When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().

If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.

Set bo_private to the GEOM consumer on the device bufobj.

Change the ffs_ops->strategy function to call g_vfs_strategy()

In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.

In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.

Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.


# fae974f1 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Degeneralize the per cdev copyonwrite callback. The only possible value
is ffs_copyonwrite() and the only place it can be called from is FFS which
would never want to call another filesystems copyonwrite method, should one
exist, so there is no reason why anything generic should know about this.


# b08c753b 16-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do not traverse list of snapshots if there isn't one.

Found by: scottl


# b85e29f0 16-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Missed a place where snapshots were allocated in my last commit to
this file.


# 67673e66 13-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Create struct snapdata which contains the snapshot fields from cdev
and the previously malloc'ed snapshot lock.

Malloc struct snapdata instead of just the lock.

Replace snapshot fields in cdev with pointer to snapdata (saves 16 bytes).

While here, give the private readblock() function a vnode argument
in preparation for moving UFS to access GEOM directly.


# b403319b 28-Jul-2004 Alexander Kabaev <kan@FreeBSD.org>

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# e3c5a7a4 04-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.


# 86030e4a 18-Jun-2004 Jun Kuriyama <kuriyama@FreeBSD.org>

Avoid deadlock which is caused by locking VDIR of parent and VREG of
snapshot itself in wrong order.
We can skip unlink check of that directory because it must have
snapshot in it.

Reviewed by: mckusick and current@


# fa885116 15-Jun-2004 Julian Elischer <julian@FreeBSD.org>

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


# 1a5ff928 08-Jun-2004 Stefan Farfeleder <stefanf@FreeBSD.org>

Avoid assignments to cast expressions.

Reviewed by: md5
Approved by: das (mentor)


# df1941fb 12-Feb-2004 Jun Kuriyama <kuriyama@FreeBSD.org>

Fix style bugs in previous commit.

Submitted by: bde


# 5580f04a 12-Feb-2004 Jun Kuriyama <kuriyama@FreeBSD.org>

Reverse lock order by using local variable. This will shut up "acquiring
duplicate lock of same type" message.

Reviewed by: mckusick


# 291027ce 03-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

Avoid calling vprint on a vnode while holding its interlock mutex.
Move diagnostic printf after vget. This might delay the debug
output some, but at least it keeps kernel from exploding if
DEBUG_VFS_LOCKS is in effect.


# c78b8dfa 12-Nov-2003 Alan Cox <alc@FreeBSD.org>

Call free(9) after the vnode interlock is released, avoiding a lock-order
reversal.


# ca430f2e 04-Nov-2003 Alexander Kabaev <kan@FreeBSD.org>

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


# 787f162d 23-Oct-2003 John Baldwin <jhb@FreeBSD.org>

Move the P_COWINPROGRESS flag from being a per-process p_flag to being a
per-thread td_pflag which doesn't require any locks to read or write as it
is only read or written by curthread on itself.

Glanced at by: mckusick


# bd189c8c 17-Oct-2003 Kirk McKusick <mckusick@FreeBSD.org>

When expunging unlinked files from a snapshot, skip over holes in the
file rather than panicing with "indiracct: botched params".

Submitted by: Mark Santcroos <marks@ripe.net>


# 53938b4a 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Skip over xvp if XLOCK is set.


# 934914d2 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Fix an unlocked call to GETATTR by slightly shuffling the code in
ffs_snapshot() around.
- Acquire the interlock before releasing the mntvnode_mtx. Use the
interlock to protect v_usecount access.


# f4636c59 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 51da11a2 29-Apr-2003 Mark Murray <markm@FreeBSD.org>

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


# a15cc359 22-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Lock both the proc lock and sched_lock when calling sched_nice since
kg_nice is now protected by both. Being protected by both means that
other places in the kernel that want to read kg_nice only need one of the
two locks.


# 86711bae 11-Apr-2003 Jeff Roberson <jeff@FreeBSD.org>

- Use the sched_nice() api instead of setting the nice value directly.

Tested by: Steve Kargl <sgk@troutmask.apl.washington.edu>


# 31566c96 20-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Use td->td_ucred instead of td->td_proc->p_ucred.


# b4b138c2 18-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 34968037 07-Mar-2003 Kirk McKusick <mckusick@FreeBSD.org>

Use the appropriate size when zeroing out the unused portion
of a snapshot's copy of a superblock. This patch fixes a panic
when taking a snapshot of a 4096/512 filesystem.

Reported by: Ian Freislich <ianf@za.uu.net>
Sponsored by: DARPA & NAI Labs.


# 7261f5f6 03-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# 5bb651cb 21-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a deadlock between the bufdaemon and a process taking
a snapshot. As part of taking a snapshot of a filesystem, the kernel
builds up a list of the filesystem metadata (such as the cylinder
group bitmaps) that are contained in the snapshot. When doing a
copy-on-write check, the list is first consulted. If the block being
written is found on the list, then the full snapshot lookup can be
avoided. Besides providing an important performance speedup this
check also avoids a potential deadlock between the code creating
the snapshot and the bufdaemon trying to cleanup snapshot related
buffers. This fix creates a temporary list containing the key
metadata blocks that can cause the deadlock. This temporary list
is used between the time that the snapshot is first enabled and the
time that the fully complete list is built.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 37e2ebfd 21-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a bug on an active filesystem on which a snapshot
is being taken from panicing with either "freeing free block" or
"freeing free inode". The problem arises when the snapshot code
is scanning the filesystem looking for inodes with a reference
count of zero (e.g., unlinked but still open) so that it can
expunge them from its view. If it encounters a reclaimed vnode
and has to restart its scan, then it will panic if it encounters
and tries to free an inode that it has already processed. The fix
is to check each candidate inode to see if it has already been
processed before trying to delete it from the snapshot image.

Sponsored by: DARPA & NAI Labs.


# d60682c2 21-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a bug in the logical block calculation macros so
that they convert to 64-bit values before shifting rather than
afterwards. Once fixed, they can be used rather than inline expanded.

Sponsored by: DARPA & NAI Labs.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 4c572f62 18-Dec-2002 Kirk McKusick <mckusick@FreeBSD.org>

Fix corruption introduced in previous delta.

Reported by: Aurelien Nephtali <aurelien.nephtali@wanadoo.fr>
Sponsored by: DARPA & NAI Labs.


# 6d967351 18-Dec-2002 Kirk McKusick <mckusick@FreeBSD.org>

Keep comments consistent with the code. Minor optimization.

Sponsored by: DARPA & NAI Labs.


# 8efcd9a7 15-Dec-2002 Kirk McKusick <mckusick@FreeBSD.org>

Update to previous change (1.54) to use an approperly wide inode field
so as to work correctly on 64-bit platforms.

Reported-by: Jake Burkholder <jake@locore.ca>
Sponsored by: DARPA & NAI Labs.
Approved by: Ian Dowse <iedowse@maths.tcd.ie>


# 0db138a6 13-Dec-2002 Kirk McKusick <mckusick@FreeBSD.org>

Only the most recent snapshot contains the complete list of blocks
that were copied in all of the earlier snapshots, thus its precomputed
list must be used in the copyonwrite test. Using incomplete lists may
lead to deadlock. Also do not include the blocks used for the indirect
pointers in the indirect pointers as this may lead to inconsistent
snapshots.

Sponsored by: DARPA & NAI Labs.
Approved by: re


# 0cb652d9 03-Dec-2002 Kirk McKusick <mckusick@FreeBSD.org>

Have to use bread() rather than UFS_BALLOC() when obtaining a
previously allocated block as the previous use of the block may
have fallen out of the cache. Failure to reread its contents cause
zeroed results to be written instead of the proper contents.
Conversely, when the block is going to be entirely filled in, it
is not necessary reread the old contents.

Sponsored by: DARPA & NAI Labs.
Approved by: re


# c6964d3b 30-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Remove a race condition / deadlock from snapshots. When
converting from individual vnode locks to the snapshot
lock, be sure to pass any waiting processes along to the
new lock as well. This transfer is done by a new function
in the lock manager, transferlockers(from_lock, to_lock);
Thanks to Lamont Granquist <lamont@scriptkiddie.org> for
his help in pounding on snapshots beyond all reason and
finding this deadlock.

Sponsored by: DARPA & NAI Labs.


# 63cf5b0e 30-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Fix two deadlocks in snapshots:

1) Release the snapshot file lock while suspending the system. Otherwise
a process trying to read the lock may block on its containing directory
preventing the suspension from completing. Thanks to Sean Kelly
<smkelly@zombie.org> for finding this deadlock.

2) Replace some bdwrite's with bawrite's so as not to fill all the
buffers with dirty data. The buffers could not be cleaned as the
snapshot vnode was locked hence the system could deadlock when
making snapshots of really massive filesystems. Thanks to
Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp> for figuring
this out.

Sponsored by: DARPA & NAI Labs.


# ada981b2 26-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Create a new 32-bit fs_flags word in the superblock. Add code to move
the old 8-bit fs_old_flags to the new location the first time that the
filesystem is mounted by a new kernel. One of the unused flags in
fs_old_flags is used to indicate that the flags have been moved.
Leave the fs_old_flags word intact so that it will work properly if
used on an old kernel.

Change the fs_sblockloc superblock location field to be in units
of bytes instead of in units of filesystem fragments. The old units
did not work properly when the fragment size exceeeded the superblock
size (8192). Update old fs_sblockloc values at the same time that
the flags are moved.

Suggested by: BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk>
Sponsored by: DARPA & NAI Labs.


# cdf5e9cc 15-Nov-2002 Peter Wemm <peter@FreeBSD.org>

Do not assume that time_t is an int.

Approved by: re (jhb)


# 9ab73fd1 24-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# 0152387a 21-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

This update further fine tunes the locking of snapshot vnodes in
the ffs_copyonwrite routine to avoid a deadlock between the syncer
daemon trying to sync out a snapshot vnode and the bufdaemon
trying to write out a buffer containing the snapshot inode.
With any luck this will be the last snapshot race condition.

Sponsored by: DARPA & NAI Labs.


# e03486d1 21-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

This checkin reimplements the io-request priority hack in a way
that works in the new threaded kernel. It was commented out of
the disksort routine earlier this year for the reasons given in
kern/subr_disklabel.c (which is where this code used to reside
before it moved to kern/subr_disk.c):

----------------------------
revision 1.65
date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0
Comment out Kirks io-request priority hack until we can do this in a
civilized way which doesn't cause grief.

The problem is that it is not generally safe to cast a "struct bio
*" to a "struct buf *". Things like ccd, vinum, ata-raid and GEOM
constructs bio's which are not entrails of a struct buf.

Also, curthread may or may not have anything to do with the I/O request
at hand.

The correct solution can either be to tag struct bio's with a
priority derived from the requesting threads nice and have disksort
act on this field, this wouldn't address the "silly-seek syndrome"
where two equal processes bang the diskheads from one edge to the
other of the disk repeatedly.

Alternatively, and probably better: a sleep should be introduced
either at the time the I/O is requested or at the time it is completed
where we can be sure to sleep in the right thread.

The sleep also needs to be in constant timeunits, 1/hz can be practicaly
any sub-second size, at high HZ the current code practically doesn't
do anything.
----------------------------

As suggested in this comment, it is no longer located in the disk sort
routine, but rather now resides in spec_strategy where the disk operations
are being queued by the thread that is associated with the process that
is really requesting the I/O. At that point, the disk queues are not
visible, so the I/O for positively niced processes is always slowed
down whether or not there is other activity on the disk.

On the issue of scaling HZ, I believe that the current scheme is
better than using a fixed quantum of time. As machines and I/O
subsystems get faster, the resolution on the clock also rises.
So, ten years from now we will be slowing things down for shorter
periods of time, but the proportional effect on the system will
be about the same as it is today. So, I view this as a feature
rather than a drawback. Hence this patch sticks with using HZ.

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>


# 86aeb27f 15-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Change locking so that all snapshots on a particular filesystem share
a common lock. This change avoids a deadlock between snapshots when
separate requests cause them to deadlock checking each other for a
need to copy blocks that are close enough together that they fall
into the same indirect block. Although I had anticipated a slowdown
from contention for the single lock, my filesystem benchmarks show
no measurable change in throughput on a uniprocessor system with
three active snapshots. I conjecture that this result is because
every copy-on-write fault must check all the active snapshots, so
the process was inherently serial already. This change removes the
last of the deadlocks of which I am aware in snapshots.

Sponsored by: DARPA & NAI Labs.


# cba63e02 08-Oct-2002 Maxime Henrion <mux@FreeBSD.org>

Fix build of 64 bit platforms.


# 4d533db1 09-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

When creating a snapshot, create a list of initially allocated blocks.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.

Sponsored by: DARPA & NAI Labs.


# a2c4ff97 08-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Remove LK_INTERLOCK from the vn_lock() in ffs_snapshot().

Pointy hat to: me
Found by: green


# 6ef17634 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Document broken locking.
- Use vrefcnt().


# cf09d674 20-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

We don't need to #include <sys/disklabel.h>.
We don't need to #include <sys/disklabel.h> second time either.

Sponsored by: DARPA & NAI Labs.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# cfbf0a46 23-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by: mckusick


# 10cfbc19 23-Jun-2002 Matthew Dillon <dillon@FreeBSD.org>

Rename the BALLOC flags from B_* to BA_* to avoid confusion with the
struct buf B_ flags.

Approved by: mckusick


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 7110af75 12-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

ARGH! SBLOCK is not unused. Try to get this right.

BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant).

Define SBLOCK again, using the right math.

Sponsored by: DARPA & NAI Labs.


# 6f1e8551 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 367b50a2 18-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some printf format errors (hopefully all of the remaining daddr64_t
ones for GENERIC, and all others on the same line as those). Reformat
the printfs if necessary to avoid new long lones or old format printf
errors.


# a0595d02 16-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.


# 0d2af521 15-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


# fdcc1cc0 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required. However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 2c100766 11-Feb-2002 Julian Elischer <julian@FreeBSD.org>

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


# c9f96392 01-Feb-2002 Kirk McKusick <mckusick@FreeBSD.org>

When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.


# 99bef878 17-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26
which caused incomplete snapshots to be taken. When background
fsck would run on these snapshots, the result would be files
being incorrectly released which would subsequently panic the
kernel with ``handle_workitem_freefile: inodedep survived'',
``handle_written_inodeblock: live inodedep'', and
``handle_workitem_remove: lost inodedep'' errors.


# f305c5d1 18-Dec-2001 Kirk McKusick <mckusick@FreeBSD.org>

Change the atomic_set_char to atomic_set_int and atomic_clear_char
to atomic_clear_int to ease the implementation for the sparc64.

Requested by: Jake Burkholder <jake@locore.ca>


# cc5a9233 13-Dec-2001 Kirk McKusick <mckusick@FreeBSD.org>

Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 7389126d 14-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

Further fixes for deadlock in the presence of multiple snapshots.
There are still more to find, but this fix should cover the
common cases that folks are hitting.


# 9b35c30c 11-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

Remove yet another deadlock case.


# 27b047ac 08-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

Several fixes for units errors:
1) Do not assume that the superblock will be of size fs->fs_bsize.
This fixes a panic when taking a snapshot on a filesystem with
a block size bigger than 8K.
2) Properly calculate the number of fragments that follow the
superblock summary information. This fixes a bug with inconsistent
snapshots.
3) When cleaning up a snapshot that is about to be removed, properly
calculate the number of blocks that need to be checked. This fixes
a bug that created partially allocated inodes.
4) When moving blocks from a snapshot that is about to be removed
to another snapshot, properly account for the reduced number of
blocks in the snapshot from which they are taken. This fixes a
bug in which the number of blocks released from a snapshot did not
match the number that it claimed to have.


# 23371b2f 03-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce
the amount of time that the filesystem must be suspended. The
current snapshot is elided as well as the earlier snapshots.


# 855aa097 28-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# c9509f58 25-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

Rather than copying all the indirect blocks of the snapshot,
simply mark them as BLK_NOCOPY. This trick cuts the initial
size of the snapshot in half and cuts the time to take a
snapshot by a third.


# 112f7372 25-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 1a6a6610 13-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
* Filesystem flags.
*
* Note that the FS_NEEDSFSCK flag is set and cleared only by the
* fsck utility. It is set when background fsck finds an unexpected
* inconsistency which requires a traditional foreground fsck to be
* run. Such inconsistencies should only be found after an uncorrectable
* disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
* it has successfully cleaned up the filesystem. The kernel uses this
* flag to enforce that inconsistent filesystems be mounted read-only.
*/
#define FS_UNCLEAN 0x01 /* filesystem not clean at mount */
#define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */
#define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */


# 31c6ce0a 20-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Clear the fs_clean flag only when the FS_UNCLEAN flag is not set
(as is done in unmount).

Remove a snapshot inode from the superblock list when its last
name goes away rather than when its last reference goes away.
That way it will be properly reclaimed by fsck after a crash
rather than reenabled when the filesystem is mounted.


# 589c7af9 07-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).


# d5a08a60 11-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


# f55ff3f3 15-Jan-2001 Ian Dowse <iedowse@FreeBSD.org>

The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by: mckusick


# cb3ab5aa 12-Jan-2001 Kirk McKusick <mckusick@FreeBSD.org>

Properly compute the size of the final block of superblock summary information.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


# 48d61748 18-Dec-2000 Kirk McKusick <mckusick@FreeBSD.org>

Several small but important fixes for snapshots:

1) Be more tolerant of missing snapshot files by only trying to decrement
their reference count if they are registered as active.

2) Fix for snapshots of filesystems with block sizes larger than 8K
(from Ollivier Robert <roberto@eurocontrol.fr>).

3) Fix to avoid losing last block in snapshot file when calculating blocks
that need to be copied (from Don Coleman <coleman@coleman.org>).


# 8461bdba 17-Sep-2000 Dag-Erling Smørgrav <des@FreeBSD.org>

Silence a warning.


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 3592b715 26-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


# 9b971133 23-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# d303f71f 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Brain fault, forgot to update ffs_snapshot.c with the new calling convention
for vn_start_write.


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).