History log of /freebsd-current/sys/ufs/ffs/ffs_softdep.c
Revision Date Author Comments
# 35a30155 03-Dec-2023 Kirk McKusick <mckusick@FreeBSD.org>

Increase UFS/FFS maximum link count from 32767 to 65530.

The link count for a UFS/FFS inode is stored in a signed 16-bit
integer. Thus the maximum link count has been 32767.

This limit has been recently hit by the poudriere build system when
doing a ports build as it needs one directory per port and the
number of ports recently passed 32767.

A long-term solution would be to use one of the spare 32-bit fields
in the inode to store the link count. However, the UFS1 format does
not have a spare and adding the spare in UFS2 would make it hard
to make it compatible when running on older kernels that use the
original link count field. So this patch uses the much simpler
approach of changing the existing link count field from a signed
16-bit value to an unsigned 16-bit value. It has the fewest lines
of code changes. The only thing that changes is the type in the
dinode and inode structures and the definition of UFS_LINK_MAX. It
has the added benefit that it works with both UFS1 and UFS2.

It allows easy backward compatibility. Indeed it is backward
compatibility that is the primary reason to go with this approach.
If a filesystem with the new organization is mounted on an older
kernel, it still needs to work. Thus if we move the new link count
to a new field, we still need to maintain the old link count as
best as possible even when running on a kernel that knows about the
larger link counts. And we would have to carry this overhead for
the indefinite future.

If we have a new link-count field, we will have to add a new
filesystem flag to indicate that we are running with larger link
counts. We will also need to add of one of the new-feature flags
to say that we have larger link counts. Older kernels clear the
new-feature flags that they do not know about, so when a filesystem
is used on an older kernel and then moved back to a newer one, the
newer one will know that the new link counts have not been maintained
and that it will be necessary to run a full fsck on the filesystem
to correct the link counts before it can be mounted.

With this change, older kernels will generally work with the bigger
counts. While it will not itself allow the link count to exceed
32767, it will have no problem working with inodes that have a link
count greater than 32767. Since it tests that i_nlink <= UFS_LINK_MAX,
counts that are bigger than 32767 will appear negative, so will
still pass the test. Of course, if they ever drop below 32767, they
will no longer be able to exceed 32767. The one issue is if the
link count ever exceeds 65535 then it will wrap to zero and the
older kernel will be none the wiser. But this corner case is likely
to be very rare since these kernels and the applications running
on them do not expect to be able to get link counts over 32767. And
over time, the use of new filesystems on older kernels will become
rarer and rarer.

Reported-by: Mark Millard running poudriere on the ports tree
Reviewed-by: kib, olce.freebsd_certner.fr
Tested-by: Peter Holm, Mark Millard
MFC-after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D42767


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# d4a8f5bf 07-Aug-2023 Kirk McKusick <mckusick@FreeBSD.org>

Handle UFS/FFS file deletion from cylinder groups with check-hash failure.

When a file is deleted, its blocks need to be put back in the free
block list and its inode needs to be put back in the inode free list.
These lists reside in cylinder-group maps. If either some of its blocks
or its inode reside in a cylinder-group map with a bad check hash
it is not possible to free the associated resource. Since the cylinder
group cannot be repaired until the filesystem is unmounted these
resources cannot be freed. They simply accumulate in memory. And
any attempt to unmount the filesystem loops forever trying to flush them.

With this change, the resource update claims to succeed so that the
file deletion can successfully complete. The filesystem is marked as
requiring an fsck so that before the next time that the filesystem is
mounted, the offending cylinder groups are reconstructed causing the
lost resources to be reclaimed.

A better solution would be to downgrade the filesystem to read-only,
but that capability is not currently implemented.

Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 831b1ff7 27-Jul-2023 Kirk McKusick <mckusick@FreeBSD.org>

UFS/FFS: Migrate to modern uintXX_t from u_intXX_t.

As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.

No functional changes intended.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# bb24eaea 05-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_lock_pair(): allow to request shared locking

If either of vnodes is shared locked, lock must not be recursed.

Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444


# fe5e6e2c 29-Mar-2023 Kirk McKusick <mckusick@FreeBSD.org>

Improvement in UFS/FFS directory placement when doing mkdir(2).

The algorithm for laying out new directories was devised in the 1980s
and markedly improved the performance of the filesystem. In those days
large disks had at most 100 cylinder groups and often as few as 10-20.
Modern multi-terrabyte disks have thousands of cylinder groups. The
original algorithm does not handle these large sizes well. This change
attempts to expand the scope of the original algorithm to work well
with these much larger disks while still retaining the properties
of the original algorithm for small disks.

The filesystem implementation is divided into policy routines and
implementation routines. The policy routines can be changed in any
way desired without risk of corrupting the filesystem. The policy
requests are handled by the implementation layer. If the policy
asks for an available resource, it is granted. But if it asks for
an already in-use resource, then the implementation will provide
an available one nearby the request. Thus it is impossible for a
policy to double allocate. This change is limited to the policy
implementation.

This change updates the ffs_dirpref() routine which is responsible
for selecting the cylinder group into which a new directory should
be placed. If we are near the root of the filesystem we aim to
spread them out as much as possible. As we descend deeper from the
root we cluster them closer together around their parent as we
expect them to be more closely interactive. Higher-level directories
like usr/src/sys and usr/src/bin should be separated while the
directories in these areas are more likely to be accessed together
so should be closer. And directories within commands or kernel
subsystems should be closer still.

We pick a range of cylinder groups around the cylinder group of the
directory in which we are being created. The size of the range for
our search is based on our depth from the root of our filesystem.
We then probe that range based on how many directories are already
present. The first new directory is at 1/2 (middle) of the range;
the second is in the first 1/4 of the range, then at 3/4, 1/8, 3/8,
5/8, 7/8, 1/16, 3/16, 5/16, etc.

It is desirable to store the depth of a directory in its on-disk
inode so that it is available when we need it. We add a new field
di_dirdepth to track the depth of each directory. Because there are
few spare fields left in the inode, we choose to share an existing
field in the inode rather than having one of our own. Specifically
we create a union with the di_freelink field. The di_freelink field
is used to track inodes that have been unlinked but remain referenced.
It is not needed until a rmdir(2) operation has been done on a
directory. At that point, the directory has no contents and even
if it is kept active as a current directory is no longer able to
have any new directories or files created in it. Thus the use of
di_dirdepth and di_freelink will never coincide.

Reported by: Timo Voelker
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39246


# 6e1eabad 07-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

ffs_syncvnode(): avoid a LoR for SU

There is another case where SU code does ffs_syncvnode(dvp) for the
parent directory dvp while the child vnode vp is locked. Avoid the
issue by relocking and returning ERELOOKUP to indicate the need of
resync.

Reported by: jkim
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37997


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# 064e6b43 13-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Rewrite function definitions in the UFS/FFS code base with identifier lists.

The K&R style in UFS and other places in the tree's days are numbered
as this syntax is removed in C2x proposal N2432:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf

Though running to nearly 6000 lines of diffs this update should cause
no functional change to the code.

Requested by: Warner Losh
MFC after: 2 weeks


# ecbbb0c8 19-Apr-2022 Stefan Eßer <se@FreeBSD.org>

ffs: plug a set-but-not-used var


# d4b3b0c2 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

ufs: Fix a typo in a source code comment

- s/explicitely/explicitly/

MFC after: 3 days


# 8d8589b3 17-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

ufs: be more persistent with finishing some operations

when the vnode is doomed after relock. The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data. Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.

Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.

Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072


# bebff615 19-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

ffs_softdep: Fix a typo in a source code comment

- s/conditonally/conditionally/

MFC after: 3 days


# 2030ee0e 19-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

ufs: remove write-only variables

Mark variables as __diagused for invariant-only vars

Reviewed by: imp, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32577


# b4a58fbf 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove cn_thread

It is always curthread.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32453


# 3b29c8b4 24-Aug-2021 Keith Owens <keith.owens2@dell.com>

ddb: do not assume that ffs is mounted with softdep

Avoid a panic when debugging with "show ffs" in ddb.

Reviewed By: kib, markj, mckusick
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31622


# a91716ef 29-Jul-2021 Kirk McKusick <mckusick@FreeBSD.org>

Clean up orphaned indirdep dependency structures after disk failure.

During forcible unmount after a disk failure there is a bug that
causes one or more indirdep dependency structures to fail to be
deallocated. Until we manage to track down why they fail to get
cleaned up, this code tracks them down and eliminates them so that
the unmount can succeed.

Reported by: Peter Holm
Help from: kib
Reviewed by: Chuck Silvers
Tested by: Peter Holm
MFC after: 7 days
Sponsored by: Netflix


# 412b5e40 29-Jul-2021 Kirk McKusick <mckusick@FreeBSD.org>

Diagnotic improvement to soft dependency structure management.

The soft updates diagnotic code keeps a list for each type of soft
update dependency. When a new block is allocated for a file it is
initially tracked by a "newblk" dependency. The "newblk" dependency
eventually becomes either an "allocdirect" dependency or an "indiralloc"
dependency. The diagnotic code failed to move the "newblk" from the list
of "newblk"s to its new type list.

No functional change intended.

Reviewed by: Chuck Silvers (as part of a larger change)
Tested by: Peter Holm (as part of a larger change)
Sponsored by: Netflix


# 58109a87 23-Jul-2021 John Baldwin <jhb@FreeBSD.org>

Use an ANSI C function declaration for journal_check_space.

GCC6 fails to compile this due to a -Wstrict-prototypes error.

Sponsored by: Chelsio Communications


# 50acaaef 15-Jun-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_softdep: force sync if journal is low in journal_check_space

This effectively causes syncing of the mount point from softdep_prealloc(),
softdep_prerename(), and softdep_prelink(). Typically it avoids the need
for journal suspension at this point, at all.

Suggested and reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041


# 2126f103 15-Jun-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_softdep.c: add journal_check_space() helper

Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041


# 64b494a1 01-May-2021 Konstantin Belousov <kib@FreeBSD.org>

softdep_prelink(): only do sync if other thread changed the vnode metadata since previous prelink

We call into softdep_prerename() and softdep_prelink() when there is
low free space in the journal. Functions sync all vnodes participating
in the VOP, in the hope that this would reduce journal utilization. But
if the vnodes are already synced, doing sync would only spend writes,
journal is filled not due to the records from modifications of our
vnodes.

Remember original seqc numbers for vnodes, and only initiate syncs when
seqc changed.

Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041


# d0929a99 29-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: reduce number of dvp relocks in softdep_prelink()

If vp == NULL, we unlocked and then immediately relocked dvp there.

Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041


# e3d67595 13-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

b_vflags update requries bufobj lock

The trunc_dependencies() issue was reported by Alexander Lochmann
<alexander.lochmann@tu-dortmund.de>, who found the problem by performing
lock analysis using LockDoc, see https://doi.org/10.1145/3302424.3303948.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 0b3948e7 06-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

softdep_unmount: assert that no dandling dependencies are left

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 7a8d4b4d 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

FFS: assign fully initialized struct mount_softdeps to um_softdep

Other threads observing the non-NULL um_softdep can assume that it is
safe to use it. This is important for ro->rw remounts where change from
read-only to read-write status cannot be made atomic.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# d7e5e374 28-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

softdep_unmount: handle spurious wakeups

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# fabbc3d8 28-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

softdep_flush(): do not access ump after we acked FLUSH_EXIT and unlocked SU lock

otherwise we might follow a pointer in the freed memory.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 7c7a6681 28-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: clear MNT_SOFTDEP earlier when remounting rw to ro

Suppose that we remount rw->ro and in parallel some reader tries to
instantiate a vnode, e.g. during lookup. Suppose that softdep_unmount()
already started, but we did not cleared the MNT_SOFTDEP flag yet.
Then ffs_vgetf() calls into softdep_load_inodeblock() which accessed
destroyed hashes and freed memory.

Set/clear fs_ronly simultaneously (WRT to files flush) with MNT_SOFTDEP.
It might be reasonable to move the change of fs_ronly to under MNT_ILOCK,
but no readers take it.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 7f682bdc 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Rework MOUNTED/DOING SOFTDEP/SUJ macros

Now MNT_SOFTDEP indicates that SU are active in any variant +-J, and
SU+J is indicated by MNT_SOFTDEP | MNT_SUJ combination. The reason is
that unmount will be able to easily hide SU from other operations by
clearing MNT_SOFTDEP while keeping the record of the active journal.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 81cdb19e 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs softdep: clear ump->um_softdep on softdep_unmount()

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# fd97fa64 03-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Add FFSV_FORCEINODEDEP flag for ffs_vgetf()

It will be used to allow SU flush code to sync the volume while external
consumers see that SU is already disabled on the filesystem. Use it where
ffs_vgetf() called by SU code to process dependencies.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 25aac48d 04-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

simplify journal_mount: move the out label after success block

This removes the need to check for error == 0.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 28703d27 31-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs softdep: Force processing of VI_OWEINACT vnodes when there is inode shortage

Such vnodes prevent inode reuse, and should be force-cleared when ffs_valloc()
is unable to find a free inode.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 2011b44f 03-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

softdep_request_cleanup: wait for softdep_request_clean_flush() to pass

if we noted a parallel request is active and declined to overflow the
system with parallel redundant sync of the vnodes. But we need to wait
for the flush to finish to see if there are any freed resources.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 013168db 30-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ufs_inactive(): stop hiding ERELOOKUP from ffs_truncate(), return it.

VFS should retry inactivation when possible, then. This should provide
timely removal of unlinked unreferenced inodes.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# ede40b06 23-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs softdep: remove will_direnter argument of softdep_prelink()

Originally this was done in 8a1509e442bc9a075 to forcibly cover cases
where a hole in the directory could be created by extending into
indirect block, since dependency of writing out indirect block is not
tracked. This results in excessive amount of fsyncing the directories,
where all creation of new entry forced fsync before it. This is not needed,
it is enough to fsync when IN_NEEDSYNC is set, and VOP_VPUT_PAIR() provides
the required hook to only perform required syncing.

The series of changes culminating in this commit puts the performance of
metadata-intensive loads back to that before 8a1509e442bc9a075.

Analyzed by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 93dba42c 11-Dec-2020 Ryan Libby <rlibby@FreeBSD.org>

ffs: quiet -Wstrict-prototypes

Reviewed by: kib, markj, mckusick
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D27558


# 92bcefd1 26-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

clear_inodedeps: handle ERELOOKUP from ffs_syncvnode().

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation


# 07ef907f 25-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

ffs_softdep.c: get_parent_vp(): Fix bp lock leak when inum inode was already freed.

Reported by: markj, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 8a1509e4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Handle LoR in flush_pagedep_deps().

When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock. Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed. This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps(). The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed. All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP. For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied. Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes. [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged. A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive(). In this case, dropping the
vnode lock is not safe. Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT. ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 61846fc4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a framework that tracks exclusive vnode lock generation count for UFS.

This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored. Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents. If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter. Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock. It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# f4499487 11-Nov-2020 Mark Johnston <markj@FreeBSD.org>

ffs: Clamp BIO_SPEEDUP length

On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by
softdep_request_cleanup() may be negative when assigned to bp->b_bcount,
which has type "long".

Clamp the size to LONG_MAX. Also convert the unused g_io_speedup() to
use an off_t for the magnitude of the shortage for consistency with
softdep_send_speedup().

Reviewed by: chs, kib
Reported by: pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27081


# d90f2c36 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: clean up empty lines in .c and .h files


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# a92a971b 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the thread argument from vget

It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)


# 52488b51 04-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Further evaluation of the POSIX spec for fdatasync() shows that it
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.

This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.

Reviewed by: kib
MFC with: -r361785
Differential revision: https://reviews.freebsd.org/D25072


# d79ff54b 25-May-2020 Chuck Silvers <chs@FreeBSD.org>

This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.

- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 98b68446 18-Feb-2020 Kirk McKusick <mckusick@FreeBSD.org>

Additional KASSERTs to ensure the consistency of the soft updates
indirdep structure. No functional change.

Tested by: Peter Holm (as part of a larger patch)
Sponsored by: Netflix


# 13532153 16-Feb-2020 Scott Long <scottl@FreeBSD.org>

Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command. Add complimentary support to the CAM periphs that
support it. This is a redo of r357710.


# 85eb41f7 10-Feb-2020 Scott Long <scottl@FreeBSD.org>

Revert r357710 and 357711 until they can be debugged


# 7d99bda7 09-Feb-2020 Scott Long <scottl@FreeBSD.org>

Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command. Add complimentary support to the CAM periphs that
support it.


# 62612737 03-Feb-2020 Chuck Silvers <chs@FreeBSD.org>

With INVARIANTS, track all softdep dependency structures centrally
so that we can find them in dumps.

Approved by: mckusick (mentor)
Sponsored by: Netflix


# 38b37b93 16-Jan-2020 Warner Losh <imp@FreeBSD.org>

We only want to send the speedup to the lower layers when there's a shortage.

Only send a speedup when there's a shortage. While this is a little racy, lost
races aren't a big deal for this function. If there's a shorage just popping up
after we check these values, then we'll catch it next time. If there's a
shortage that's just clearing up, we may do some work at the lower layers a
little sooner than we otherwise would have. Sicne shortages are relatively rare
events, both races are acceptable.

Reviewed by: chs
Differential Revision: https://reviews.freebsd.org/D23182


# 3cf5dd84 16-Jan-2020 Warner Losh <imp@FreeBSD.org>

Use buf to send speedup

It turns out there's a problem with using g_io to send the speedup. It leads to
a race when there's a resource shortage when a disk fails.

Instead, send BIO_SPEEDUP via struct buf. This is pretty straight forward,
except we need to transfer the bio_flags from b_ioflags for BIO_SPEEDUP commands
in g_vfs_strategy.

Reviewed by: kirk, chs
Differential Revision: https://reviews.freebsd.org/D23117


# 815b7486 13-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix a long standing bug in journaled soft-updates. The dirrem structure
needs to handle file removal, directory removal, file move, directory move,
etc. The code in handle_workitem_remove() needs to propagate any completed
journal entries to the write that will render the change stable. In the
case of a moved directory this means the new parent. However, for an
overwrite that frees a directory (DIRCHG) we must move the jsegdep to the
removed inode to be released when it is stable in the cg bitmap or the
unlinked inode list. This case was previously unhandled and caused a
panic.

Reported by: mckusick, pho
Reviewed by: mckusick
Tested by: pho


# ac4ec141 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add a setter for inode i_flag field

This will be used later to add vnodes to the lazy list.

Reviewed by: kib (previous version), jeff
Tested by: pho (in a larger patch)
Differential Revision: https://reviews.freebsd.org/D22994


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 4085590d 27-Dec-2019 Konstantin Belousov <kib@FreeBSD.org>

ufs: do not leave non-reclaimed vnodes with zero i_mode around.

After a recent change, vput() relocks even the exclusively locked
vnode before inactivating it. Before that, UFS could safely
instantiate a vnode for cleared inode, then the last vput() after
ffs_vgetf() noted that ip->i_mode == 0 and recycled. Now, it is
possible for other threads to note the half-constructed vnode, e.g. to
insert it into hash, which makes other threads to use it despite mode
is zero, before inactivation and reclaim.

Handle the found cases in SU code, by explicitly doing reclaim.
Assert that other places get fully constructed inode from ffs_vgetf(),
which cannot be cleared before dependencies are resolved.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 56e4d458 18-Dec-2019 Warner Losh <imp@FreeBSD.org>

Drop a sleepable lock when we plan on sleeping

g_io_speedup waits for the completion of the speedup request before proceeding
using biowait(), but check_clear_deps is called with the softdeps lock held
(which is non-sleepable). It's safe to drop this lock around the call to
speedup, so do that.

Submitted by: Peter Holm
Reviewed by: kib@


# 22dd705f 16-Dec-2019 Warner Losh <imp@FreeBSD.org>

Add BIO_SPEEDUP signalling to UFS

When we have a resource shortage in UFS, send down a BIO_SPEEDUP to
give the CAM I/O scheduler a heads up that we have a resource shortage
and that it should bias its decisions knowing that.

Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# d00066a5 03-Dec-2019 Kirk McKusick <mckusick@FreeBSD.org>

Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 486b9a61 19-Nov-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add some KASSERTs. Reacquire a mutex after a kernel printf rather
than holding it during the printf. White space cleanup.

Sponsored by: Netflix


# 7792f701 24-Oct-2019 Kirk McKusick <mckusick@FreeBSD.org>

Soft updates needs to keep an on-disk linked list of inodes that
have been unlinked, but are still referenced by open file descriptors.
These inodes cannot be freed until the final file descriptor reference
has been closed. If the system crashes while they are still being
referenced, these inodes and their referenced blocks need to be
freed by fsck. By having them on a linked list with the head pointer
in the superblock, fsck can quickly find and process them rather
than having to check every inode in the filesystem to see if it is
unreferenced.

When updating the head pointer of this list of unlinked inodes in
the superblock, the superblock check-hash was not getting updated.
If the system crashed with the incorrect superblock check-hash, the
superblock would appear to be corrupted. This patch ensures that
the superblock check-hash is updated when updating the head pointer
of the unlinked inodes list.

There is no need to MFC as superblock check hashes first appeared in
13.0.

Tested by: Peter Holm
Sponsored by: Netflix


# c456a0a1 18-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Abbreviate softdep lock names.

The softdep lock names were unusually long and tended to stick out in
lock profiling reports. Abbreviate them and make them consistent with
our conventional style for lock names.

Reviewed by: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22042


# fdd888de 04-Oct-2019 Eric van Gyzen <vangyzen@FreeBSD.org>

Add CTLFLAG_STATS to several debug.softdep sysctl OIDs

Refer to r353111.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon


# 4cace859 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert struct mount counters to per-cpu

There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637


# e87f3f72 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: manage mnt_writeopcount with atomics

See r352424.

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21575


# d89ac450 09-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

Remove some unneeded vfs_busy() calls in SU code.

When softdep_fsync() is running, a caller must already started write
for the mount point. Since unmount or remount to ro suspends mount
point, it cannot run in parallel with softdep_fsync(), which makes
vfs_busy() call there not needed.

Doing blocking vfs_busy() there effectively causes lock order reversal
between vn_start_write() and setting MNTK_UNMOUNT, because
vfs_busy(mp, 0) sleeps waiting for MNTK_UNMOUNT becoming clear, while
unmount sets the flag and starts the suspension.

Note that all other uses of vfs_busy() in SU code are non-blocking.

Reported by: chs by mckusick
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f3cf6225 06-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

ufs: Remove redundant brelse() after r294954

Same automation.

No functional change.


# e671edac 23-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

De-commision the MNTK_NOINSMNTQ kernel mount flag.

After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 90381b1c 31-Jul-2019 Kirk McKusick <mckusick@FreeBSD.org>

When updating the user or group disk quotas for the return of inodes or
disk blocks, set the FORCE flag in the call to chkiq() or chkdq() since
the user is always allowed to return resources and hence there is no need
to check the user's credential .

Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-1-UFS-1: Denial Of Service in mount (prison_priv_check)
Discussed with: kib
MFC: 1 week
Sponsored by: Netflix


# fdf34aa3 17-Jul-2019 Kirk McKusick <mckusick@FreeBSD.org>

The error reported in FS-14-UFS-3 can only happen on UFS/FFS
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.

A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.

This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.

Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by: kib
Sponsored by: Netflix


# 6137883f 26-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Remove references to splbio in ffs_softdep.c.

Assert that the per-mountpoint softdep mutex is held in modified
functions that do not already have this assertion. No functional
change intended.

Reviewed by: kib, mckusick (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20741


# e9482844 28-May-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add a missing bresle() in seldom-used error return.


# af6aeacb 28-May-2019 Kirk McKusick <mckusick@FreeBSD.org>

Convert use of UFS-specific #ifdef DEBUG to DIAGNOSTIC or INVARIANTS
as appropriate. No functional change intended.

Suggested-by: markj


# 298184ac 27-May-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add function name and line number debugging information to softupdates
worklist structures to help track their movement between work lists.
No functional change to the operation of soft updates intended.


# 69166928 20-Mar-2019 Kirk McKusick <mckusick@FreeBSD.org>

This is an additional and hopefully final fix for bug report 230962.
This bug was introduced with the change to use softdep_bp_to_mp()
in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VSOCK as one of the valid cases.

Although local-domain sockets do not allocate blocks in the filesystem,
they will allocate blocks if they use extended attributes (such as
ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
pointer when presented with a socket vnode so that the soft updates
write complete will properly process the soft updates structures
associated with the extended attribute blocks. It was the failure
to process these soft updates structures, thus leaving them hanging
off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
dangling deps" when trying to clean up the buffer after it was written.

PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix


# 42a5a356 11-Mar-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add KASSERT to the softdep_disk_write_complete() function in the
soft dependency code to ensure that it will be able to avoid a
dangling dependency.

Sponsored by: Netflix


# baba6af7 28-Jan-2019 Kirk McKusick <mckusick@FreeBSD.org>

This bug was introduced with the change to use softdep_bp_to_mp() in
January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VFIFO as one of the valid cases.

Although fifo's do not allocate blocks in the filesystem, they will
allocate blocks if they use extended attributes (such as ACLs). Thus,
softdep_bp_to_mp() needs to return a non-NULL mount pointer when
presented with a fifo vnode so that the soft updates write complete
will properly process the soft updates structures associated with the
extended attribute blocks. It was the failure to process these soft
updates structures, thus leaving them hanging off the buffer, which
lead to the "panic: softdep_deallocate_dependencies: dangling deps"
when trying to clean up the buffer after it was written.

PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix


# 6967c09c 25-Jan-2019 Kirk McKusick <mckusick@FreeBSD.org>

Expand DDB's set of printable soft dependency data structures. The
set of known soft dependency data structures now includes: sd_worklist,
sd_inodedep, sd_allocdirect, sd_allocindir, and sd_mkdir. DDB can
also print lists of sd_allinodedeps, sd_mkdir_list, and sd_workhead.
The sd_workhead script is useful for listing all the dependencies
associated with a buffer, e.g. bp->b_dep.

Prefix the soft dependency show names with sd_ so that they sort
together when listed by DDB's "show help" and to distinguish them
from other data structures printable by DDB.

Sponsored by: Netflix


# 8f829a5c 11-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS. This change adds a
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.

Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.

Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# cc426dd3 11-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation


# 9fc5d538 13-Nov-2018 Kirk McKusick <mckusick@FreeBSD.org>

In preparation for adding inode check-hashes, clean up and
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.

Convert the utilities that had been using the undocumented
interface to use the new documented interface.

No functional change (as for now the libufs library does not
do inode check-hashes).

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# b7befdf5 22-Sep-2018 Konstantin Belousov <kib@FreeBSD.org>

Correct panic messages.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
Approved by: re (rgrimes)
MFC after: 1 week


# 7e038bc2 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by: Peter Holm
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix


# cc91864c 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.


# 68c49bcc 06-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix


# 8ab50758 16-May-2018 Kirk McKusick <mckusick@FreeBSD.org>

Fix warning found by Coverity.

CID 1009353: Error handling issues (CHECKED_RETURN)


# f8ccf173 04-Apr-2018 Kirk McKusick <mckusick@FreeBSD.org>

Renumber soft-update types starting at 1 instead of 0 to avoid confusion
of zero'ed memory appearing to have a valid soft-update type.

Also correct some comments.

Reviewed by: kib


# d8ba45e2 16-Mar-2018 Ed Maste <emaste@FreeBSD.org>

Revert r313780 (UFS_ prefix)


# 1e2b9afc 16-Mar-2018 Ed Maste <emaste@FreeBSD.org>

Prefix UFS symbols with UFS_ to reduce namespace pollution

Followup to r313780. Also prefix ext2's and nandfs's versions with
EXT2_ and NANDFS_.

Reported by: kib
Reviewed by: kib, mckusick
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9623


# d4e6557b 28-Feb-2018 Conrad Meyer <cem@FreeBSD.org>

ffs: softdep_disk_write_complete: Quiesce spurious Coverity warning

Coverity cannot determine that handle_written_indirdep() does not access
uninitialized 'sbp' when flags argument is zero.

So, simply move the initialization slightly sooner to silence the warning.

No functional change.

Reported by: Coverity
Sponsored by: Dell EMC Isilon


# a94a2945 24-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

ext2fs|ufs:Unsign some values related to allocation.

When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after: 2 weeks


# f9834d10 24-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

Revert r327781, r328093, r328056:
ufs|ext2fs: Revert uses of mallocarray(9).

These aren't really useful: drop them.
Variable unsigning will be brought again later.


# 90b618f3 17-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: use mallocarray(9).

Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.

Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by: mckusick


# 147b0c1a 11-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Softlink inodes can own buffers with dependencies.

At least, softlinks longer than 120 bytes have data fragments.

Submitted by: mckusick
MFC after: 5 days


# c999b435 09-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Generalize the fix from r322757 and apply it to several more places.

The code accesses bp->b_dep without owning the ufs mount softdep lock,
which makes it possible for the derefenced workitem to be freed in
parallel. In particular, the deallocate_dependencies(),
softdep_disk_io_initiation() and softdep_disk_write_complete() are
affected.

Move the code to safely calculate ump from the buffer with
dependencies into the helper softdep_bp_to_mp() and use it for all
found cases.

Tested by: pho (as part of the bigger patch)
Reviewed by: mckusick (as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e51e3c7e 09-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

When handling write completion, take SU lock around calls to
handle_written_XXX() in case of processing the buffer with an error.

Tested by: pho (as part of the bigger patch)
Reviewed by: mckusick (as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# caa7e52f 26-Dec-2017 Eitan Adler <eadler@FreeBSD.org>

kernel: Fix several typos and minor errors

- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by: imp, benno


# fe267a55 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: general adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.


# 47706559 27-Oct-2017 Mark Johnston <markj@FreeBSD.org>

Remove a stale and incorrect comment.

MFC after: 1 week
Sponsored by: Dell EMC Isilon


# 9cf7abcc 27-Oct-2017 Mark Johnston <markj@FreeBSD.org>

Remove workqueue items after updating the workqueue tail pointer.

When QUEUE_MACRO_DEBUG_TRASH is configured, the queue linkage fields
are trashed upon removal of the item, so be sure to only read them before
removing the item.

No functional change intended.

MFC after: 1 week
Sponsored by: Dell EMC Isilon


# 4c52a999 25-Oct-2017 Mark Johnston <markj@FreeBSD.org>

Make drain_output() use bufobj_wwait().

No functional change intended.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12790


# 800c3e80 26-Sep-2017 John Baldwin <jhb@FreeBSD.org>

Don't defer wakeup()s for completed journal workitems.

Normally wakeups() are performed for completed softupdates work items
in workitem_free() before the underlying memory is free()'d.
complete_jseg() was clearing the "wakeup needed" flag in work items to
defer the wakeup until the end of each loop iteration. However, this
resulted in the item being free'd before it's address was used with
wakeup(). As a result, another part of the kernel could allocate this
memory from malloc() and use it as a wait channel for a different
"event" with a different lock. This triggered an assertion failure
when the lock passed to sleepq_add() did not match the existing lock
associated with the sleep queue. Fix this by removing the code to
defer the wakeup in complete_jseg() allowing the wakeup to occur
slightly earlier in workitem_free() before free() is called.

The main reason I can think of for deferring a wakeup() would be to
avoid waking up a waiter while holding a lock that the waiter would
need. However, no locks are dropped in between the wakeup() in
workitem_free() and the end of the loop in complete_jseg() as far as I
can tell.

In general I think it is not safe to do a wakeup() after free() as one
cannot control how other parts of the kernel that might reuse the
address for a different wait channel will handle spurious wakeups.

Reported by: pho
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12494


# 4f45713a 18-Sep-2017 John Baldwin <jhb@FreeBSD.org>

Add UFS_LINK_MAX for the UFS-specific limit on link counts.

ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits. This is a nop currently but will matter if LINK_MAX is
increased in the future.

Reviewed by: kib
Sponsored by: Chelsio Communications


# 2f9d88c7 25-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Protect v_rdev dereference with the vnode interlock instead of the
vnode lock.

Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 10 days


# f0d52232 21-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Avoid dereferencing potentially freed workitem in
softdep_count_dependencies().

Buffer's b_dep list is protected by the SU mount lock. Owning the
buffer lock is not enough to guarantee the stability of the list.

Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime. To get to ump, use the pointers chain which does not
involve workitems at all.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# b5f2560d 21-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Style.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 698f05ab 03-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Mitigate several problems with the softdep_request_cleanup() on busy
host.

Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup. It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().

There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel. Due
to the locking, they quickly start executing 'in phase' with the speed
of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
non-blocking attempt to lock the vnode owned by other thread fail,
and loops executing without making the progress.
Retry logic does not allow the situation to recover. The result is
a livelock.

Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
even if there are still work items on the mp work list. An
assumption is that the items will be cleaned when other thread
either fsyncs its vnode, or unlock and allow yet another thread to
make the progress.

It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry. But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.

Reported and tested by: pho
Style and comments by: mckusick
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# aca4bb91 25-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not leak mount references for dying threads.

Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit. In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1(). softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 1dc349ab 15-Feb-2017 Ed Maste <emaste@FreeBSD.org>

prefix UFS symbols with UFS_ to reduce namespace pollution

Specifically:
ROOTINO -> UFS_ROOTINO
WINO -> UFS_WINO
NXADDR -> UFS_NXADDR
NDADDR -> UFS_NDADDR
NIADDR -> UFS_NIADDR
MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by: kib, mckusick
Obtained from: NetBSD
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9536


# 1c324569 06-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

Use type-independent formats for printing nlink_t and ino_t.

Extracted from: ino64 work by gleb, mckusick
Discussed with: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e1db6897 17-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Reduce size of ufs inode.

Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding. To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2. Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by: pho (previous version)
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 7b05b8a2 07-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Do not leak transient ENOLCK error from flush_newblk_dep() loop.

The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error
from the failed attempt is irrelevant. But since there is path after
retry which does not clear error, it is possible to return spurious
error from the function.

The issue resulted in a spurious failure of softdep_sync_buf(),
causing further spurious failure of ffs_sync().

In collaboration with: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 60f1c000 07-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

In softdep_prealloc(), return early not only for snapshots, but for
the quota files as well.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 988fd417 16-Aug-2016 Kirk McKusick <mckusick@FreeBSD.org>

Bug 211013 reports that a write error to a UFS filesystem running
with softupdates panics the kernel. The problem that has been pointed
out is that when there is a transient write error on certain metadata
blocks, specifically directory blocks (PAGEDEP), inode blocks
(INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group
(BMSAFEMAP, but only when journaling is enabled), we get a panic
in one of the routines called by softdep_disk_io_initiation that
the I/O is "already started" when we retry the write.

These dependency types potentially need to do roll-backs when called
by softdep_disk_io_initiation before doing a write and then a
roll-forward when called by softdep_disk_write_complete after the
I/O completes. The panic happens when there is a transient error.
At the top of softdep_disk_write_complete we check to see if the
write had an error and if an error occurred we just return. This
return is correct most of the time because the main role of the routines
called by softdep_disk_write_complete is to process the now-completed
dependencies so that the next I/O steps can happen.

But for the four types listed above, they do not get to do their
rollback operations. This causes the panic when softdep_disk_io_initiation
gets called on the second attempt to do the write and the roll-back
routines find that the roll-backs have already been done. As an
aside I note that there is also the problem that the buffer will
have been unlocked and thus made visible to the filesystem and to
user applications with the roll-backs in place.

The way to resolve the problem is to add a flag to the routines called
by softdep_disk_write_complete for the four dependency types noted
that indicates whether the write was successful (WRITESUCCEEDED).
If the write does not succeed, they do just the roll-backs and then
return. If the write was successful they also do their usual
processing of the now-completed dependencies.

The fix was tested by selectively injecting write errors for buffers
holding dependencies of each of the four types noted above and then
verifying that the kernel no longer paniced and that following the
successful retry of the write that the filesystem could be unmounted
and successfully checked cleanly.

PR: 211013
Reviewed by: kib


# 08e94183 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

UFS: spelling fixes on comments.

No functional change.


# d9c9c81c 21-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: use our roundup2/rounddown2() macros when param.h is available.

rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.


# abafa4db 10-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: replace 0 with NULL for pointers.

While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by: mckusick
MFC after: 2 weeks


# ae34b6ff 06-Apr-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080


# 6336aefc 23-Mar-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix locking mistake in softdep_waitidle(). The surrounding code
expects that the loop is always exited with the SU lock owned, even on
error.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 44de1ba1 21-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the
ast was rescheduled during VFS_SYNC(). It is possible that enough
parallel writes or slow/hung volume result in VFS_SYNC() deferring to
the ast flushing of workqueue.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 20e8b32d 06-Oct-2015 Gleb Smirnoff <glebius@FreeBSD.org>

In softdep_setup_freeblocks():
- Move the bread() to the beginning of function.
- Return if it fails, otherwise we will panic.

Submitted by: mckusick
Sponsored by: Netflix


# 7a82f35c 04-Sep-2015 Konstantin Belousov <kib@FreeBSD.org>

Do not consume extra reference. This is a bug in r287479.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# c65f7598 05-Sep-2015 Konstantin Belousov <kib@FreeBSD.org>

Declare the writes around the call to VFS_SYNC() in
softdep_ast_cleanup_proc().

Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 76cae067 01-Sep-2015 Konstantin Belousov <kib@FreeBSD.org>

By doing file extension fast, it is possible to create excess supply
of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and
D_ALLOCINDIR), which can exhaust kmem.

Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and
D_DIRREM, by scheduling ast to flush dependencies, after the thread,
which created new dep, left the VFS/FFS innards. For D_NEWBLK, the
only way to get rid of them is to do full sync, since items are
attached to data blocks of arbitrary vnodes. The check for D_NEWBLK
excess in softdep_ast_cleanup_proc() is unlocked.

For 32bit arches, reduce the total amount of allowed dependencies by
two. It could be considered increasing the limit for 64 bit platforms
with direct maps.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# aef373ce 31-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove unused variable.

When deallocate_dependencies() is performed,
softdep_journal_freeblocks() already called cancel_allocdirect() which
should have eliminated direct dependencies for all truncated full
blocks. The indirect dependencies are allowed above, since second-
and third-level dependencies are only dealt with by the code which
frees indirect block, which happens after the inode write.

Discussed with: mckusick, jeff
Reviewed by: jeff
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 69baeadc 29-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros. It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review: https://reviews.freebsd.org/D2665
Reviewed by: rodrigc
Looked at by: bjk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 0bc4fe10 27-May-2015 Konstantin Belousov <kib@FreeBSD.org>

After r283600, NODELAY flag to inodedep_lookup() function is unused.
Eliminate it, and simplify code by removing the local dflags variable
always initialized to DEPALLOC.

Noted by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 1bc93bb7 27-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks. Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode. A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads. NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag. This is needed because nfsd may be the sole cause of the SU
workqueue overflow. But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by: jhb, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 99289311 27-Mar-2015 Konstantin Belousov <kib@FreeBSD.org>

Fix build (with gcc).

Reported by: bz, ian
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 4af9f77e 27-Mar-2015 Konstantin Belousov <kib@FreeBSD.org>

Fix the hand after the immediate reboot when the following command
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init

Hang occurs on the rootfs unmount. There are two issues:

1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0. FFS is suspended during
unmount, so unmount just hangs.

2. As noted above, the inodedep is linked eventually. It is not
linked until the superblock is written. But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed. It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock. Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.

Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.

There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.

Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons. Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.

PR: 195458
In collaboration with: gjb, pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# f6f76e17 05-Feb-2015 Konstantin Belousov <kib@FreeBSD.org>

Partially revert r277922, avoid sleeping and do flush if we a awaken,
instead of waiting for the FLUSH_* flags. Also, when requesting
flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was
already set.

Reported and tested by: dim,
"Lundberg, Johannes" <johannes@brilliantservice.co.jp>
Bisected by: dim
MFC after: 2 weeks


# 1c9b5856 30-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

When mounting SU-enabled mount point, wait until the softdep_flush()
thread started and incremented the stat_flush_threads [1].

Unconditionally wakeup softdep_flush threads when needed, do not try
to check wchan, which is racy and breaks abstraction.

Reported by and discussed with: glebius, neel
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# ca109b01 02-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

When non-forced unmount or remount rw->ro is performed, writes on UFS
are not suspended. In particular, on the SU-enabled vulumes, there is
no reason why, between the call to softdep_flushfiles() and
softdep_waitidle(), SU work items cannot be queued.

Correct the condition to trigger the panic by only checking when
forced operation is done. Convert direct panic() call into KASSERT(),
there is no invalid on-disk data structures directly involved, so
follow the usual debugging vs. non-debugging approach.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# a6b5e6e3 12-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Revision r269457 removed the Giant around mount and unmount code, but
r269533, which was tested before r269457 was committed, implicitely
relied on the Giant to protect the manipulations of the softdepmounts
list. Use softdep global lock consistently to guarantee the list
structure now.

Insert the new struct mount_softdeps into the softdepmounts only after
it is sufficiently initialized, to prevent softdep_speedup() from
accessing bare memory. Similarly, remove struct mount_softdeps for
the unmounted filesystem from the tailq before destroying structure
rwlock.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3da0b29d 07-Aug-2014 Kirk McKusick <mckusick@FreeBSD.org>

The SUJ journal is only prepared to handle full-size block numbers, so we
have to adjust freeblk records to reflect the change to a full-size block.
For example, suppose we have a block made up of fragments 8-15 and
want to free its last two fragments. We are given a request that says:
FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0
where frags are the number of fragments to free and oldfrags are the
number of fragments to keep. To block align it, we have to change it to
have a valid full-size blkno, so it becomes:
FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6

Submitted by: Mikihito Takehara
Tested by: Mikihito Takehara
Reviewed by: Jeff Roberson
MFC after: 1 week


# 5f9500c3 04-Aug-2014 Kirk McKusick <mckusick@FreeBSD.org>

Add support for multi-threading of soft updates.

Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.

Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix


# 7d155880 06-May-2014 Scott Long <scottl@FreeBSD.org>

Due to reasons unknown at this time, the system can be forced to write
a journal block even when there are no journal entries to be written.
Until the root cause is found, handle this case by ensuring that a
valid journal segment is always written.

Second, the data buffer used for writing journal entries was never
being scrubbed of old data. Fix this.

Submitted by: Takehara Mikihito
Obtained from: Netflix, Inc.
MFC after: 3 days


# 4896af9f 01-Mar-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: small formatting fixes.

Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after: 3 days
Reviewed by: mckusick


# 2e436d88 01-Dec-2013 Kirk McKusick <mckusick@FreeBSD.org>

We needlessly panic when trying to flush MKDIR_PARENT dependencies.
We had previously tried to flush all MKDIR_PARENT dependencies (and
all the NEWBLOCK pagedeps) by calling ffs_update(). However this will
only resolve these dependencies in direct blocks. So very large
directories with MKDIR_PARENT dependencies in indirect blocks had
not yet gotten flushed. As the directory is in the midst of doing a
complete sync, we simply defer the checking of the MKDIR_PARENT
dependencies until the indirect blocks have been sync'ed.

Reported by: Shawn Wallbridge of imaginaryforces.com
Tested by: John-Mark Gurney <jmg@funkthat.com>
PR: 183424
MFC after: 2 weeks


# b6ffc3b5 20-Nov-2013 John-Mark Gurney <jmg@FreeBSD.org>

fix a use after free, jsegdep_merge will free wk, avoid the next check...

CID: 1006098
Sponsored by: Imaginary Forces
Reviewed by: mckusick
MFC after: 1 week


# 07599ccb 21-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

Fix build problem on ARM (which defaults to building without soft updates).

Reported by: Tinderbox
Sponsored by: Netflix


# 58941b9f 20-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

Restructuring of the soft updates code to set it up so that the
single kernel-wide soft update lock can be replaced with a
per-filesystem soft-updates lock. This per-filesystem lock will
allow each filesystem to have its own soft-updates flushing thread
rather than being limited to a single soft-updates flushing thread
for the entire kernel.

Move soft update variables out of the ufsmount structure and into
their own mount_softdeps structure referenced by ufsmount field
um_softdep. Eventually the per-filesystem lock will be in this
structure. For now there is simply a pointer to the kernel-wide
soft updates lock.

Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock
pointer in the mount_softdeps structure instead of a pointer to the
kernel-wide soft-updates lock.

Replace the five hash tables used by soft updates with per-filesystem
copies of these tables allocated in the mount_softdeps structure.

Several functions that flush dependencies when too many are allocated
in the kernel used to operate across all filesystems. They are now
parameterized to flush dependencies from a specified filesystem.
For now, we stick with the round-robin flushing strategy when the
kernel as a whole has too many dependencies allocated.

While there are many lines of changes, there should be no functional
change in the operation of soft updates.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# cc76ac5a 20-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

Fourth of several cleanups to soft dependency implementation.
Add KASSERTS that soft dependency functions only get called
for filesystems running with soft dependencies. Calling these
functions when soft updates are not compiled into the system
become panic's.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# 519e3c3b 20-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

Third of several cleanups to soft dependency implementation.
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# 90a306d8 20-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

Second of several cleanups to soft dependency implementation.
Delete two unused functions in ffs_sofdep.c.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# 8850120f 20-Oct-2013 Kirk McKusick <mckusick@FreeBSD.org>

First of several cleanups to soft dependency implementation.
Convert three functions exported from ffs_softdep.c to static
functions as they are not used outside of ffs_softdep.c.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# 8cf85cf2 05-Aug-2013 Kirk McKusick <mckusick@FreeBSD.org>

With the addition of journalled soft updates, the "newblk" structures
persist much longer than previously. Historically we had at most 100
entries; now the count may reach a million. With the increased count
we spent far too much time looking them up in the grossly undersized
newblk hash table. Configure the newblk hash table to accurately reflect
the number of entries that it must index.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 57591d8e 05-Aug-2013 Kirk McKusick <mckusick@FreeBSD.org>

To better understand performance problems with journalled soft updates,
we need to collect the highest level of allocation for each of the
different soft update dependency structures. This change collects these
statistics and makes them available using `sysctl debug.softdep.highuse'.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 22a72260 30-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# 97371fa5 21-May-2013 Kirk McKusick <mckusick@FreeBSD.org>

Properly spell sentinel (missed in 250891)
No functional changes.

Spotted by: Navdeep Parhar and Alexey Dokuchaev
MFC after: 2 weeks


# b1bd9340 21-May-2013 Kirk McKusick <mckusick@FreeBSD.org>

Add missing buffer releases (brelse) after bread calls that return
an error. One could argue that returning a buffer even when it is
not valid is incorrect, but bread has always returned a buffer
valid or not.

Reviewed by: kib
MFC after: 2 weeks


# 21844a3d 21-May-2013 Kirk McKusick <mckusick@FreeBSD.org>

Add missing 28th element to softdep types name array.

Found by: Coverity Scan, CID 1007621
Reviewed by: kib
MFC after: 2 weeks


# d80dbbdb 21-May-2013 Kirk McKusick <mckusick@FreeBSD.org>

Null a pointer after it is freed so that when it is returned
the return value is NULL. Based on the returned flags, the
return value should never be inspected in the case where NULL
is returned, but it is good coding practice not to return a
pointer to freed memory.

Found by: Coverity Scan, CID 1006096
Reviewed by: kib
MFC after: 2 weeks


# 64e2b088 21-May-2013 Kirk McKusick <mckusick@FreeBSD.org>

Remove a bogus check for a NULL buffer pointer.
Add a KASSERT that it is not NULL.

Found by: Coverity Scan, CID 1009114
Reviewed by: kib
MFC after: 2 weeks


# 13e369a7 21-May-2013 Kirk McKusick <mckusick@FreeBSD.org>

Properly spell sentinel (not sintenel or sentinal).
No functional changes.

Spotted by: kib
MFC after: 2 weeks


# 26089666 06-Apr-2013 Jeff Roberson <jeff@FreeBSD.org>

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


# cd861931 03-Apr-2013 Kirk McKusick <mckusick@FreeBSD.org>

The code in clear_remove() and clear_inodedeps() skips one entry
in the pagedep and inodedep hash tables. An entry in the table is
skipped because 'pagedep_hash' and 'inodedep_hash' hold the size
of the hash tables - 1.

The chance that this would have any operational failure is extremely
unlikely. These funtions only need to find a single entry and are
only called when there are too many entries. The chance that they
would fail because all the entries are on the single skipped hash
chain are remote.

Submitted by: Pedro Martelletto
Reviewed by: kib
MFC after: 2 weeks


# ba05dec5 27-Feb-2013 Konstantin Belousov <kib@FreeBSD.org>

The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks


# ddd6b3fc 10-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# b1308d72 21-Dec-2012 Attilio Rao <attilio@FreeBSD.org>

Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week


# ad9cdc05 13-Nov-2012 Jeff Roberson <jeff@FreeBSD.org>

- Fix a truncation bug with softdep journaling that could leak blocks on
crash. When truncating a file that never made it to disk we use the
canceled allocation dependencies to hold the journal records until
the truncation completes. Previously allocdirect dependencies on
the id_bufwait list were not considered and their journal space
could expire before the bitmaps were written. Cancel them and attach
them to the freeblks as we do for other allocdirects.
- Add KTR traces that were used to debug this problem.
- When adding jsegdeps, always use jwork_insert() so we don't have more
than one segdep on a given jwork list.

Sponsored by: EMC / Isilon Storage Division


# b2c29d39 12-Nov-2012 Jeff Roberson <jeff@FreeBSD.org>

- Fix a bug that has existed since the original softdep implementation.
When a background copy of a cg is written we complete any work associated
with that bmsafemap. If new work has been added to the non-background
copy of the buffer it will be completed before the next write happens.
The solution is to do the rollbacks when we make the copy so only those
dependencies that were present at the time of writing will be completed
when the background write completes. This would've resulted in various
bitmap related corruptions and panics. It also would've expired journal
entries early causing journal replay to miss some records.

MFC after: 2 weeks


# 53cc0beb 08-Nov-2012 Jeff Roberson <jeff@FreeBSD.org>

- Correct rev 242734, segments can sometimes get stuck. Be a bit more
defensive with segment state.

Reported by: b. f. <bf1783@googlemail.com>


# 40b43503 07-Nov-2012 Jeff Roberson <jeff@FreeBSD.org>

- Implement BIO_FLUSH support around journal entries. This will not 100%
solve power loss problems with dishonest write caches. However, it
should improve the situation and force a full fsck when it is unable
to resolve with the journal.
- Resolve a case where the journal could wrap in an unsafe way causing
us to prematurely lose journal entries in very specific scenarios.

Discussed with: mckusick
MFC after: 1 month


# 6d95eb4c 02-Nov-2012 Jeff Roberson <jeff@FreeBSD.org>

- In cancel_mkdir_dotdot don't panic if the inodedep is not available. If
the previous diradd had already finished it could have been reclaimed
already. This would only happen under heavy dependency pressure.

Reported by: Andrey Zonov <zont@FreeBSD.org>
Discussed with: mckusick
MFC after: 1 week


# f1988d46 28-Oct-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix two problems that caused instant panic when the device mounted
with softupdates went away. Note that this does not fix the problem
entirely; I'm committing it now to make it easier for someone to pick
up the work.

Reviewed by: mckusick


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# fc8fdae0 27-Sep-2012 Matthew D Fleming <mdf@FreeBSD.org>

Fix up kernel sources to be ready for a 64-bit ino_t.

Original code by: Gleb Kurtsou


# aa445c9d 11-Jun-2012 Kirk McKusick <mckusick@FreeBSD.org>

In softdep_setup_inomapdep() we may have to allocate both inodedep
and bmsafemap dependency structures in inodedep_lookup() and
bmsafemap_lookup() respectively. The setup of these structures must
be done while holding the soft-dependency mutex. If the inodedep is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the bmsafemap. If the bmsafemap is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the inodedep.

To resolve this problem, bmsafemap_lookup has had a parameter added
that allows a pre-malloc'ed bmsafemap to be passed in so that it does
not need to release the mutex to create a new bmsafemap. The
softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency
before acquiring the mutex and starting to build the inodedep with a
call to inodedep_lookup(). The subsequent call to bmsafemap_lookup()
is passed this pre-allocated bmsafemap entry so that it need not
release the mutex if it needs to create a new one.

Reported by: Peter Holm
Tested by: Peter Holm
MFC after: 1 week


# 8b620711 18-May-2012 Kirk McKusick <mckusick@FreeBSD.org>

Add missing `continue' statement at end of case.

Found by: Kevin Lo (kevlo@)
MFC after: 1 week


# 26621e1f 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument from clear_inodeps() and clear_remove().


# 71469bb3 17-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 23d6e518 02-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).

With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.

This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).

While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.

Reported by: Kirk Russell
Fix submitted by: Jeff Roberson
Tested by: Peter Holm
PR: kern/159971
MFC (to 9 only): 2 weeks


# 75a58389 24-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib


# 064f517d 13-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Supply boolean as the second argument to ffs_update(), and not a
MNT_[NO]WAIT constants, which in fact always caused sync operation.

Based on the submission by: bde
Reviewed by: mckusick
MFC after: 2 weeks


# 38ddb572 08-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after: 1 week


# 35338e60 01-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks


# e8e848ef 12-Feb-2012 Kirk McKusick <mckusick@FreeBSD.org>

Missing conditions in checking whether an inode has been written.

Found and tested by: Peter Holm
MFC after: 2 weeks (to 9 only)


# 752a98b1 06-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Add missing opt_quota.h include to activate #ifdef QUOTA blocks,
apparently a step in unbreaking QUOTA support.

Reported and tested by: Adam Strohl <adams-freebsd ateamsystems com>
MFC after: 1 week


# b313a710 06-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

JNEWBLK dependency may legitimately appear on the buf dependency
list. If softdep_sync_buf() discovers such dependency, it should do
nothing, which is safe as it is only waiting on the parent buffer to
be written, so it can be removed.

Committed on behalf of: jeff
MFC after: 1 week


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# b296414c 20-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Use nowait sync request for a vnode when doing softdep cleanup. We possibly
own the unrelated vnode lock, doing waiting sync causes deadlocks.

Reported and tested by: pho
Approved by: re (bz)


# 82378711 25-Aug-2011 Martin Matuska <mm@FreeBSD.org>

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


# fddf7bae 29-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)


# d716efa9 24-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)


# b8ea56d7 14-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Consistently check mount flag (MNTK_SUJ) rather than superblock
flag (FS_SUJ) when determining whether to do journaling-based
operations. The mount flag is set only when journaling is active
while the superblock flag is set to indicate that journaling is to
be used. For example, when the filesystem is mounted read-only, the
journaling may be present (FS_SUJ) but not active (MNTK_SUJ).
Inappropriate checking of the FS_SUJ flag was causing some
journaling actions to be attempted at inappropriate times.


# e9b4d832 04-Jul-2011 Jeff Roberson <jeff@FreeBSD.org>

- Speed up pendingblock processing again. Having too much delay between
ffs_blkfree() and the pending adjustment causes all kinds of
space related problems.


# f2803e61 04-Jul-2011 Jeff Roberson <jeff@FreeBSD.org>

- Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now
find themselves on snapshot vnodes.

Reported by: pho


# 8e4f5b70 04-Jul-2011 Jeff Roberson <jeff@FreeBSD.org>

- It is impossible to run request_cleanup() while doing a copyonwrite.
This will most likely cause new block allocations which can recurse
into request cleanup.
- While here optimize the ufs locking slightly. We need only acquire and
drop once.
- process_removes() and process_truncates() also is only needed once.
- Attempt to flush each item on the worklist once but do not loop forever
if some can not be completed.

Discussed with: mckusick


# 08af0c8b 29-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Handle the FREEDEP case in softdep_sync_buf().
This fix failed to get added in -r223325.

Submitted by: Peter Holm


# 16f7d822 19-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

- Fix directory count rollbacks by passing the mode to the journal dep
earlier.
- Add rollback/forward code for frag and cluster accounting.
- Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)


# 43a3cc77 15-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by: Jeff Roberson


# e34a7135 15-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Missing cleanup case after completion of a snapshot vnode write
claiming a released block.

Submitted by: Jeff Roberson
Tested by: Peter Holm


# 9eb8728a 12-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Update to soft updates journaling to properly track freed blocks
that get claimed by snapshots.

Submitted by: Jeff Roberson
Tested by: Peter Holm


# 9420dc62 12-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.


# 280e091a 10-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# 3d08a76b 12-May-2011 Matthew D Fleming <mdf@FreeBSD.org>

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


# 273ca851 10-Apr-2011 Jeff Roberson <jeff@FreeBSD.org>

- Refactor softdep_setup_freeblocks() into a set of functions to prepare
for a new journal specific partial truncate routine.
- Use dep_current[] in place of specific dependency counts. This is
automatically maintained when workitems are allocated and has
less risk of becoming incorrect.


# 4ac80906 09-Apr-2011 Jeff Roberson <jeff@FreeBSD.org>

Fix a long standing SUJ performance problem:

- Keep a hash of indirect blocks that have recently been freed and are
still referenced in the journal.
- Lookup blocks in this hash before forcing a new block write to wait on
the journal entry to hit the disk. This is only necessary to avoid
confusion between old identities as indirects and new identities as
file blocks.
- Don't free jseg structures until the journal has written a record that
invalidates it. This keeps the indirect block information around for
as long as is required to be safe.
- Force an empty journal block write when required to flush out stale
journal data that is simply waiting for the oldest valid sequence
number to advance beyond it.


# 59343c7b 06-Apr-2011 Jeff Roberson <jeff@FreeBSD.org>

- Don't invalidate jnewblks immediately upon discovering that the block
will be removed. Permit the journal to proceed so that we don't leave
a rollback in a cg for a very long time as this can cause terrible perf
problems in low memory situations.

Tested by: pho


# 4c821a39 05-Apr-2011 Kirk McKusick <mckusick@FreeBSD.org>

Be far more persistent in reclaiming blocks and inodes before giving
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.

Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks


# f79d4144 02-Apr-2011 Jeff Roberson <jeff@FreeBSD.org>

Fix problems that manifested from filesystem full conditions:

- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
the jaddref so we can make assumptions about where the dotaddref is on
the list. cancel_jaddref() does not always remove items from the list
anymore.
- Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
was never set. This ensures that dependencies will continue to be
processed on the inowait/bufwait list and is more an artifact of
the structure of the code than a pure ordering problem.
- Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
appropriately. This normally occurs when the refs are added to the
journal but if they are canceled before this point the state would
never be set and the dependency could never be freed.

Reported by: pho
Tested by: pho


# 861ed116 27-Mar-2011 Konstantin Belousov <kib@FreeBSD.org>

Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.

Submitted by: Aleksandr Rybalko <ray dlink ua>


# 0a809056 22-Mar-2011 Kirk McKusick <mckusick@FreeBSD.org>

Add retry code analogous to the block allocation retry code
to avoid running out of inodes.

Reported by: Peter Holm


# 455a6e0f 11-Feb-2011 Konstantin Belousov <kib@FreeBSD.org>

Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with: pho
Reviewed by: jeff
Tested by: bz, pho


# 3eb6e131 09-Feb-2011 Alexander Leidinger <netchild@FreeBSD.org>

Add some FEATURE macros for some UFS features.

SU+J is not included as a FEATURE macro:
- it was not in the tree during the GSoC
- I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: kib
X-MFC after: to be determined in last commit with code from this project


# e7ceb1e9 07-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


# 08b163fa 02-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# fbbb13f9 12-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the kernel changes.


# ac32f117 04-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Instead of incrementing freework reference counter in indir_trunc(), do
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.

Requested and reviewed by: jeff
Tested by: pho


# 465e3ccd 30-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Handle missing jremrefs when a directory is renamed overtop of
another, deleting it. If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.

Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().

Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().

Reported by: many
Submitted by: jeff
Tested by: pho


# 42a6fc43 30-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

In indir_trunc(), when processing jnewblk entries that are not written
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.

In collaboration with: pho
Reviewed by: jeff, mckusick
Tested by: mckusick, pho


# d2d6c592 28-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Move the definition of mkdirlisthd from header to C file.

Reviewed by: mckusick
Tested by: pho


# 84ad0a66 22-Dec-2010 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a soft update panic while running perl 5.12 tests
which produced:

panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164

Reported by: Dimitry Andric
Fixed by: Jeff Roberson


# bcc5c95b 27-Nov-2010 Peter Holm <pho@FreeBSD.org>

First step in fixing the handle_workitem_freeblocks panic.

In collaboration with: kib


# be913821 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
A write of the indirect block causes the freeblks to become
ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
then, softdep_disk_write_complete() would reassign the workitem to the
mount point worklist, causing premature processing of the workitem, or
journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by: jeff
Tested by: pho


# 496fd813 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by: jeff
Tested by: pho


# d23c72cd 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by: jeff
Tested by: pho


# fae5c47d 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by: jeff
Tested by: pho


# 4e4ff016 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix typo. Function is called ffs_blkfree.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# a03e344a 02-Oct-2010 Alan Cox <alc@FreeBSD.org>

M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.


# e69bed36 29-Sep-2010 Kirk McKusick <mckusick@FreeBSD.org>

Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>


# 063045a5 29-Sep-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix typo in comment.


# c0b2efce 14-Sep-2010 Kirk McKusick <mckusick@FreeBSD.org>

Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by: jeff Roberson


# 3634d5b2 20-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


# 691401ee 12-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff


# 9f9c8c59 06-Jul-2010 Jeff Roberson <jeff@FreeBSD.org>

- Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.

Discussed with: mckusick


# d89c217f 11-Jun-2010 Andriy Gapon <avg@FreeBSD.org>

ffs_softdep: change K&R in function defintions to ANSI prototypes

Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from: clang
MFC after: 2 weeks


# f0268739 19-May-2010 Jeff Roberson <jeff@FreeBSD.org>

- Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration. This can lead to a deadlock when we have
worklist items that cannot be immediately satisfied.

Reported by: uqs, Dimitry Andric <dimitry@andric.com>

- Remove some unnecessary debugging code and place some other under
SUJ_DEBUG.
- Examine the journal state in softdep_slowdown().
- Re-format some comments so I may more easily add flag descriptions.


# 8ef48de8 07-May-2010 Jeff Roberson <jeff@FreeBSD.org>

- Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
- Don't fsync() vnodes in prealloc if copy on write is in progress. It
is not safe to recurse back into the write path here.

Reported by: Vladimir Grebenschikov <vova@fbsd.ru>


# 2c3ae115 07-May-2010 Jeff Roberson <jeff@FreeBSD.org>

- Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not. This fixes
a deadlock that can occur with unlinked but referenced files.
Journal space and inodedeps were not correctly reclaimed because
the inode block was not left dirty.

Tested/Reported by: lwindschuh@googlemail.com


# 2bd20091 28-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- When canceling jaddrefs they may not yet be in the journal if this is via
a revert call. In this case don't attempt to remove something that
has not yet been added. Otherwise this jaddref must hang around
to prevent the bitmap write as normal.


# 3b32573a 28-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Fix builds without SOFTUPDATES defined in the kernel config.


# a8750f2d 24-Apr-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix build for UFS without SOFTUPDATES.


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 4dec4ece 14-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196888:
The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Approved by: re (kensmith)


# 5c61c646 06-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by: tegge
Tested by: pho
MFC after: 1 week


# 8c5a7882 14-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196206:
Take the number of allocated freeblks into consideration for
softdep_slowdown(), to prevent kernel memory exhaustioni on
mass-truncation.

Approved by: re (rwatson)


# 165a3b41 14-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by: pluknet gmail com
Diagnosed and tested by: pho
Approved by: re (rwatson)
MFC after: 1 month


# f1eccd05 02-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks


# a50d1b2a 30-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by: pho
Approved by: re (kensmith)
MFC after: 1 month


# f0830182 02-Jun-2009 Attilio Rao <attilio@FreeBSD.org>

Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by: pho


# bc364c4e 03-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

When removing or renaming snaphost, do not delve into request_cleanup().
The later may need blocks from the underlying device that belongs
to normal files, that should not be locked while snap lock is held.

Reported and tested by: pho
MFC after: 1 month


# 83b3bdbc 02-Nov-2008 Attilio Rao <attilio@FreeBSD.org>

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


# e11e3f18 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 52dfc8d7 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month


# 0411d791 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

The struct inode *ip supplied to softdep_freefile is not neccessary the
inode having number ino. In r170991, the ip was marked IN_MODIFIED, that
is not quite correct.

Mark only the right inode modified by checking inode number.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 59d49325 31-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


# 7b7ed832 28-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by: pho, kris
Reviewed by: tegge
MFC after: 1 month


# 047dd67e 06-Apr-2008 Attilio Rao <attilio@FreeBSD.org>

Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007


# 1be222e9 23-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed: on stable@, with bde
Reviewed by: tegge, jeff [1]
MFC after: 2 weeks


# 698b1a66 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 6c62df7e 13-Mar-2008 Coleman Kane <cokane@FreeBSD.org>

Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE
callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous
items on the http://wiki.freebsd.org/SMPTODO list.

Reviewed by: imp, obrien, jhb
MFC after: 1 week


# 3eb8098d 10-Mar-2008 Ed Maste <emaste@FreeBSD.org>

Remove include of opt_quota.h; as of revision 1.205 there is no longer
any #ifdef QUOTA conditional code.


# 628f51d2 24-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 5b4ab4a0 09-Nov-2007 Ruslan Ermilov <ru@FreeBSD.org>

Fix build without INVARIANTS and update a comment to match
a change made in previous revision.


# 1102b89b 08-Nov-2007 David E. O'Brien <obrien@FreeBSD.org>

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


# 3745c395 20-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# d66ba370 22-Jun-2007 Konstantin Belousov <kib@FreeBSD.org>

Fix livelock that could occur when snapshoting UFS with quotas, where
some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE()
call there could cause inodeblock to have both freefile and inodedep
dependencies without any inode in the block being marked for write.
Then, softdep_check_suspend() would return EAGAIN forewer.

Force write of inodeblock with allocated freefile softdependency by
setting IN_MODIFIED flag in softdep_freefile and unconditionally calling
UFS_UPDATE() in ufs_reclaim.

Reported by: kris
Debug help and tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks


# 832eef31 03-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Add a newline to the printf message.


# 9724167c 10-Apr-2007 Konstantin Belousov <kib@FreeBSD.org>

Recalculate the NEWBLOCK flag for pagedep structure after the softdep
lock is dropped, since pagedep may be already processed and deallocated.

Found and tested by: kris
MFC after: 2 weeks


# 23743f6a 10-Apr-2007 Konstantin Belousov <kib@FreeBSD.org>

When LK_NOWAIT is passed as argument to process_worklist_item(), this
does not prevent handle_workitem_remove() from recursing into a blocking
version. Add the dirrem to worklist instead of processing it now if this
is the case.

Reported and tested by: kris
Submitted by: tegge
MFC after: 2 weeks


# 04533fc6 04-Apr-2007 Xin LI <delphij@FreeBSD.org>

Use *_EMPTY macros when appropriate.


# 06f0c8dc 29-Mar-2007 Konstantin Belousov <kib@FreeBSD.org>

Revert rev. 1.205. Replace unconditional acquision of Giant when QUOTAS are
defined with VFS_LOCK_GIANT(NULL) call.
This shall fix softdep operation when mpsafe_vfs = 0.

Reported and tested by: kris
Submitted by: tegge
MFC after: 1 week


# 36d46679 20-Mar-2007 Konstantin Belousov <kib@FreeBSD.org>

Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more
neccessary Giant acquisions in softdepend processing code.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


# 98fff6b5 23-Feb-2007 Brian Somers <brian@FreeBSD.org>

Account for di_blocks allocations when IN_SPACECOUNTED is set in an
inode's i_flag.

It's possible that after ufs_infactive() calls softdep_releasefile(),
i_nlink stays >0 for a considerable amount of time (> 60 seconds here).
During this period, any ffs allocation routines that alter di_blocks
must also account for the blocks in the filesystem's fs_pendingblocks
value.

This change fixes an eventual df/du discrepency that will happen as
the result of fs_pendingblocks being reduced to <0.

The only manifestation of this that people may recognise is the
following message on boot:

/somefs: update error: blocks -N files M

at which point the negative pending block count is adjusted to zero.

Reviewed by: tegge
MFC after: 3 weeks


# 2276d081 01-Nov-2006 Konstantin Belousov <kib@FreeBSD.org>

Aquire Giant in the softdep_flush for clear_remove() and clear_inodedeps()
processing when QUOTA is set.

Reported and tested by: Peter Holm
Reviewed by: tegge
MFC after: 3 days


# e60c3612 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Reduce fluctuations of mnt_flag to allow unlocked readers to get a
slightly more consistent view.


# 55b4ff0d 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Increase mnt_noasync once in softdep_mount() to disallow async io,
closing a window where a file system using softupdates could be async
for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags
to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount()
doesn't increase mnt_noasync multiple times.


# 5da56ddb 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


# 28de2218 20-Sep-2006 Konstantin Belousov <kib@FreeBSD.org>

Fix the glitch introduced in rev. 1.93. In softdep_sync_metadata(),
switch by worklist type contains two for() loops, for D_INDIRDEP and
D_PAGEDEP. On error, these loops are exited by break, where the switch
actually shall be leaved. Use goto instead of break to reach the error
handling code.

Reported by: Peter Holm
Reviewed by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks


# e45269bf 16-May-2006 Tom Rhodes <trhodes@FreeBSD.org>

Provide a less cryptic panic message in place of just "found inode."


# 43e07fff 06-May-2006 Tor Egge <tegge@FreeBSD.org>

ffs_syncvnode() might skip some of the blocks due to them being locked,
assuming them to be inflight write buffers. This is not always the case.
bufdaemon might hold the buffer lock and give up writing the buffer due to it
having dependencies, the file system being suspended or the vnode lock being
held by another thread. When bufdaemon decides to write the buffer there is
still a window before bufobj_wref() has been called, allowing other threads to
believe that the vnode has no dirty buffers or inflight writes.

Try harder to flush first block of new subdirectory to get rid of MKDIR_BODY
dependency.


# 39fac379 17-Apr-2006 Ken Smith <kensmith@FreeBSD.org>

Fix panic() message to give the right function name.


# 68e84666 03-Apr-2006 Tor Egge <tegge@FreeBSD.org>

Eliminate softdep_flush() livelock by accounting for number of worklist items
marked as being in progress.


# 2eedeb7e 11-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.

Pointed out by: tegge


# 791dd2fa 08-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# b9b12498 02-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Acquire lk in softdep_slowdown so that it's owned when we call
softdep_speedup().
- Assert that lk is held in softdep_speedup() rather than acquiring it.
This avoids a potential lock recursion.


# eb2ea105 01-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# 6c62b2ac 09-Jan-2006 Tor Egge <tegge@FreeBSD.org>

If the lock passed to getdirtybuf() is the softdep lock then the background
write completed wakeup could be missed. Close the race by grabbing the lock
normally used for protection of bp->b_xflags.

Reviewed by: truckman


# c8c7711d 09-Jan-2006 Tor Egge <tegge@FreeBSD.org>

Broaden scope of softdep_worklist_busy rwlock protection of softdep processing
to avoid some dependencies being missed by softdep_flushworklist().

Reviewed by: truckman


# cd34c8b6 23-Dec-2005 Xin LI <delphij@FreeBSD.org>

Typo.


# 445193b8 29-Sep-2005 Don Lewis <truckman@FreeBSD.org>

After a rmdir()ed directory has been truncated, force an update of
the directory's inode after queuing the dirrem that will decrement
the parent directory's link count. This will force the update of
the parent directory's actual link to actually be scheduled. Without
this change the parent directory's actual link count would not be
updated until ufs_inactive() cleared the inode of the newly removed
directory, which might be deferred indefinitely. ufs_inactive()
will not be called as long as any process holds a reference to the
removed directory, and ufs_inactive() will not clear the inode if
the link count is non-zero, which could be the result of an earlier
system crash.

If a background fsck is run before the update of the parent directory's
actual link count has been performed, or at least scheduled by
putting the dirrem on the leaf directory's inodedep id_bufwait list,
fsck will corrupt the file system by decrementing the parent
directory's effective link count, which was previously correct
because it already took the removal of the leaf directory into
account, and setting the actual link count to the same value as the
effective link count after the dangling, removed, leaf directory
has been removed. This happens because fsck acts based on the
actual link count, which will be too high when fsck creates the
file system snapshot that it references.

This change has the fortunate side effect of more quickly cleaning
up the large number dirrem structures that linger for an extended
time after the removal of a large directory tree. It also fixes a
potential problem with the shutdown of the syncer thread timing out
if the system is rebooted immediately after removing a large directory
tree.

Submitted by: tegge
MFC after: 3 days


# d536ff2e 05-Sep-2005 Tor Egge <tegge@FreeBSD.org>

Retain generation count when writing zeroes instead of an inode to disk.

Don't free a struct inodedep if another process is allocating saved inode
memory for the same struct inodedep in initiate_write_inodeblock_ufs[12]().

Handle disappearing dependencies in softdep_disk_io_initiation().

Reviewed by: mckusick


# 15da51f7 21-Aug-2005 Tor Egge <tegge@FreeBSD.org>

Don't set the COMPLETE flag in an inodedep structure before the related
inode has been written.


# ed8938e0 31-Jul-2005 Stephan Uphoff <ups@FreeBSD.org>

Delay freeing disk space for file system blocks until all dirty buffers
are safely released. This fixes softdep problems on truncation (deletion)
of files with dirty buffers.

Reviewed by: jeff@, mckusick@, ps@, tegge@
Tested by: glebius@, ps@
MFC after: 3 weeks


# 6c71a220 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't restrict the softdep stats to DEBUG kernels, they cost nothing to
export. This was happening anyway since this file manually sets DEBUG.
- Add a sysctl for the number of items on the worklist.
- Use a more canonical loop restart in softdep_fsync_mountdev, it saves
some code at the expense of a goto and makes me worry less about
modifying a variable that should be private to the TAILQ_FOREACH_SAFE
macro.


# 153910e0 03-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix
two bugs.
- ffs_disk_prewrite was pulling the vp from the buf and checking for
COPYONWRITE, when really it wanted the vp from the bufobj that we're
writing to, which is the devvp. This lead to us skipping the copy on
write to all file data, which significantly broke snapshots for the
last few months.
- When the SOFTUPDATES option was not included in the kernel config we
would also skip the copy on write check, which would effectively disable
snapshots.
- Remove an invalid mp_fixme().

Debugging tips from: mckusick
Reported by: iedowse, others
Discussed with: phk


# 188f6433 25-Mar-2005 David Schultz <das@FreeBSD.org>

When the softupdates worklist gets too long, threads that attempt to
add more work are forced to process two worklist items first.
However, processing an item may generate additional work, causing the
unlucky thread to recursively process the worklist. Add a per-thread
flag to detect this situation and avoid the recursion. This should
fix the stack overflows that could occur while removing large
directory trees.

Tested by: kris
Reviewed by: mckusick


# fdcc8227 12-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.

Sponsored by: Isilon Systems, Inc.


# 1a4a9672 22-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add VOP locking asserts in several functions that have been implicated in
recent deadlocks.


# a16baf37 20-Feb-2005 Xin LI <delphij@FreeBSD.org>

The recomputation of file system summary at mount time can be a
very slow process, especially for large file systems that is just
recovered from a crash.

Since the summary is already re-sync'ed every 30 second, we will
not lag behind too much after a crash. With this consideration
in mind, it is more reasonable to transfer the responsibility to
background fsck, to reduce the delay after a crash.

Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to
control this behavior. When set to nonzero, we will get the
"old" behavior, that the summary is computed immediately at mount
time.

Add five new sysctl variables to adjust ndir, nbfree, nifree,
nffree and numclusters respectively. Teach fsck_ffs about these
API, however, intentionally not to check the existence, since
kernels without these sysctls must have recomputed the summary
and hence no adjustments are necessary.

This change has eliminated the usual tens of minutes of delay of
mounting large dirty volumes.

Reviewed by: mckusick
MFC After: 1 week


# 1121c394 11-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make non-SOFTUPDATES kernels compile again.

Integrate the stubfile into the main file now that license issues have been
long resolved.


# 365b18aa 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

style polishing.


# 44787ceb 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Use ffs_truncate() directly instead of UFS_TRUNCATE()


# dd19a799 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Background writes are entirely an FFS/Softupdates thing.

Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.

Remove all the background write related stuff from the normal bufwrite.

This drags the softdep_move_dependencies() back into FFS.

Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.


# 88e5b12a 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Drag another softupdates tentacle back into FFS: Now that FFS's
vop_fsync is separate from the internal use we can do the full job
there.


# efd6d980 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


# 40854ff5 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 9087d86e 02-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Use a seperate malloc tag for saved inode contents to help in debugging
memory modified after free errors.

Sponsored by: Isilon Systems, Inc.


# 08023360 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Convert the global LK lock to a mutex.
- Expand the scope of lk to cover not only interrupt races, but also
top-half races, which includes many new uses over global top-half
only data.
- Get rid of interlocked_sleep() and use msleep or BUF_LOCK where
appropriate.
- Use the lk mutex in place of the various hand rolled semaphores.
- Stop dropping the lk lock before we panic.
- Fix getdirtybuf() callers so that they reacquire access to whatever
softdep datastructure they were inxpecting in the failure/retry
case. Previously, sleeps in getdirtybuf() could leave us with
pointers to bad memory.
- Update handling of ffs to be compatible with ffs locking changes.

Sponsored By: Isilon Systems, Inc.


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 43920011 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move UFS from DEVFS backing to GEOM backing.

This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.

Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1

Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.

Details:

Add a geom consumer and a bufobj pointer to ufsmount.

Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.

In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.

When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().

If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.

Set bo_private to the GEOM consumer on the device bufobj.

Change the ffs_ops->strategy function to call g_vfs_strategy()

In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.

In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.

Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.


# 93d244fb 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

KASSERT that we only get to prewrite() on writes.


# 6e77a041 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

The island council met and voted buf_prewrite() home.

Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.

Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.

Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.


# fae974f1 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Degeneralize the per cdev copyonwrite callback. The only possible value
is ffs_copyonwrite() and the only place it can be called from is FFS which
would never want to call another filesystems copyonwrite method, should one
exist, so there is no reason why anything generic should know about this.


# 156cb265 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# 494eb176 22-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# a76d8f4e 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 7ac439fe 08-Aug-2004 Poul-Henning Kamp <phk@FreeBSD.org>

use bufdone() not biodone().


# b403319b 28-Jul-2004 Alexander Kabaev <kan@FreeBSD.org>

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# f65de26b 10-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Update for the KDB debugger framework:
o Make debugging code conditional upon KDB.
o Use kdb_backtrace() instead of backtrace().
o Remove inclusion of opt_ddb.h.


# 255ec151 06-Apr-2004 John Baldwin <jhb@FreeBSD.org>

Fix a paste-o from the buf_prewrite() cleanup commit and check for the
MNTK_SUSPEND flag on the correct vnode pointer in softdep_disk_prewrite().

Reviewed by: phk
Tested by: kensmith


# ceb58ca5 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

When I was a kid my work table was one cluttered mess an cleaning it up
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.

Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.

It's not pretty, but at least it's one less layer violated.


# 4d453ef1 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 546a1660 22-Feb-2004 Kirk McKusick <mckusick@FreeBSD.org>

In the function clear_inodedeps(), a FREE_LOCK() should be called
AFTER the call to vn_start_write(), not before it. Otherwise, it is
possible to unlock it multiple times if the vn_start_write() fails.

Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>


# 787f162d 23-Oct-2003 John Baldwin <jhb@FreeBSD.org>

Move the P_COWINPROGRESS flag from being a per-process p_flag to being a
per-thread td_pflag which doesn't require any locks to read or write as it
is only read or written by curthread on itself.

Glanced at by: mckusick


# a844eb93 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- My last commit to this file is still not safe, I believe that it may be
due to the recursion in indir_trunc().


# 8af6a570 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Reinstate 1.142 this was fixed by 1.144.


# cac3558d 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- The VI assert in getdirtybuf() is only valid if we're not on a VCHR
vnode. VCHR vnodes don't do background writes.

Reported by: kan


# 04c81ad8 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove a mp_fixme() and some locks that weren't necessary. I now
understand how this works.


# cfd5600c 02-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Several of the callers to getdirtybuf() were erroneously changed to pass
in a list head instead of a pointer to the first element at the time of
the first call. These lists are subject to change, and getdirtybuf()
would refetch from the wrong list in some cases.

Spottedy by: tegge
Pointy hat to: me


# 23efe6da 31-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- Backout rev 1.142. This caused a deadlock that I do not understand. More
investigation is required.


# d919a11d 31-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to
bail out if the buffer is not already present.
- The buffer returned by incore() is not locked and should not be sent to
brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the
desired semantics.


# a0ebaadd 31-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- Don't acquire the vnode interlock in drain_output(). Instead, require the
caller to acquire it. This permits drain_output() to be done atomically
with other operations as well as reducing the number of lock operations.
- Assert that the proper locks are held in drain_output().
- Change getdirtybuf() to accept a mutex as an argument. This mutex is used
to protect the vnode's buf list and the BKGRDWAIT flag. This lock is
dropped when we successfully acquire a buffer and held on return
otherwise. These semantics reduce the number of cumbersome cases in
calling code.
- Pass the mtx from getdirtybuf() into interlocked_sleep() and allow this
mutex to be used as the interlock argument to BUF_LOCK() in the LOCKBUF
case of interlocked_sleep().
- Change the return value of getdirtybuf() to be the resulting locked buffer
or NULL otherwise. This is for callers who pass in a list head that
requires a lock. It is necessary since the lock that protects the list
head must be dropped in getdirtybuf() so that we don't have a lock order
reversal with the buf queues lock in bremfree().
- Adjust all callers of getdirtybuf() to match the new semantics.
- Add a comment in indir_trunc() that points at unlocked access to a buf.
This may also be one of the last instances of incore() in the tree.


# 9dbfeb0a 28-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
interlock.
- Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
buffers. Check for the BKGRDINPROG flag before recycling or throwing
away a buffer. We do this instead because it is not safe for us to move
the original buffer to a new queue from the callback on the background
write buffer.
- Remove the B_LOCKED flag and the locked buffer queue. They are no longer
used.
- The vnode interlock is used around checks for BKGRDINPROG where it may
not be strictly necessary. If we hold the buf lock the a back-ground
write will not be started without our knowledge, one may only be
completed while we're not looking. Rather than remove the code, Document
two of the places where this extra locking is done. A pass should be
done to verify and minimize the locking later.


# b4b138c2 18-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 7261f5f6 03-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# 521f364b 02-Mar-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


# 17661e5a 24-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


# 3bf0ed94 24-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

When removing the last item from a non-empty worklist, the worklist
tail pointer must be updated.

Reported by: Kris Kennaway <kris@obsecurity.org>
Sponsored by: DARPA & NAI Labs.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# fa06a012 07-Jan-2003 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a problem caused by applications that rapidly and
repeatedly truncate the same file. Each time the file is truncated,
a buffer is grabbed to store the indirect block numbers that need
to be freed. Those blocks cannot be freed until the inode claiming
them is written to disk. Thus, the number of buffers being held by
soft updates explodes and in extreme cases can run the kernel out
of buffers. The problem can be avoided by doing an fsync on the
file every debug.maxindirdep truncates (currently defaulted to 50).
The fsync causes the inode to be written so that the held buffers
can be freed. The check for excessive buffers is checked as part
of the existing hook for excessive dependencies (softdep_slowdown)
in the truncate code.

Reported by: David Schultz <dschultz@uclink.Berkeley.EDU>
Sponsored by: DARPA & NAI Labs.
MFC after: 3 weeks


# 9d5abbdd 01-Jan-2003 Jens Schweikhardt <schweikh@FreeBSD.org>

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# 120a6d84 17-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused lockcnt variable.

Approved by: mckusick


# f5235f70 19-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

The target for the maximum number of dependencies has been cut
in half because of reports that under heavy load the kernel could
exhaust its memory pool. The limit is now (desiredvnodes * 4)
rather than (desiredvnodes * 8), so it will still scale with
larger systems, just not as quickly.

Sponsored by: DARPA & NAI Labs.


# 3374bb5a 19-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

If an error occurs while writing a buffer, then the data will
not have hit the disk and the dependencies cannot be unrolled.
In this case, the system will mark the buffer as dirty again so
that the write can be retried in the future. When the write
succeeds or the system gives up on the buffer and marks it as
invalid (B_INVAL), the dependencies will be cleared.

Sponsored by: DARPA & NAI Labs.


# c0762674 23-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

We must be careful to avoid recursive copy-on-write faults when
trying to clean up during disk-full senarios.

Sponsored by: DARPA & NAI Labs.


# 2eff16f0 22-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Missplaced FREE_LOCK causes a panic when hit while taking a snapshot.

Sponsored by: DARPA & NAI Labs.


# 85de3147 28-Sep-2002 Juli Mallett <jmallett@FreeBSD.org>

When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough
to include a newline.

MFC after: 4 days
Sponsored by: Bright Path Solutions


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 2ee5711e 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Convert locks to use standard macros.
- Lock access to the buflists.
- Document broken locking.
- Use vrefcnt().


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 382f95d3 19-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Fix a warning:
ffs_softdep.c:1630: warning: int format, different type arg (arg 2)


# 7aca6291 19-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 6bd521df 01-Jul-2002 Ian Dowse <iedowse@FreeBSD.org>

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


# 5346934f 01-Jul-2002 Ian Dowse <iedowse@FreeBSD.org>

Add the ffs bits necessary to support unloading of the ufs kernel
module. This adds an ffs_uninit() function that calls ufs_uninit()
and also calls a new softdep_uninitialize() function. Add a stub
for softdep_uninitialize() to cover the non-SOFTUPDATES case.

Reviewed by: mckusick


# cfbf0a46 23-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by: mckusick


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 8fdbc99b 17-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix ufs_daddr_t/daddr_t type problems.

Sponsored by: DARPA & NAI labs.


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# 5dacf954 14-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Don't peak into the malloc_type structure for limits. The desired vnodes
check should be sufficient. This is required for the pending removal of
malloc_type limits.


# 6f1e8551 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a0595d02 16-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.


# 0d2af521 15-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


# f0c8652e 14-Mar-2002 David E. O'Brien <obrien@FreeBSD.org>

Quiet a warning on the Alpha.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# cfdaa886 06-Feb-2002 Kirk McKusick <mckusick@FreeBSD.org>

Occationally deleted files would hang around for hours or days
without being reclaimed. This bug was introduced in revision 1.95
dealing with filenames placed in newly allocated directory blocks,
thus is not present in 4.X systems. The bug is triggered when a
new entry is made in a directory after the data block containing
the original new entry has been written, but before the inode
that references the data block has been written.

Submitted by: Bill Fenner <fenner@research.att.com>


# c9f96392 01-Feb-2002 Kirk McKusick <mckusick@FreeBSD.org>

When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.


# 03a2057a 21-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a long standing complaint with soft updates in
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.


# 0bc7a833 12-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

When going to sleep, we must save our SPL so that it does not get
lost if some other process uses the lock while we are sleeping. We
restore it after we have slept. This functionality is provided by
a new routine interlocked_sleep() that wraps the interlocking with
functions that sleep. This function is then used in place of the
old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.

Submitted by: Debbie Chu <dchu@juniper.net>


# 794ef347 11-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

Must call drain_output() before checking the dirty block list
in softdep_sync_metadata(). Otherwise we may miss dependencies
that need to be flushed which will result in a later panic
with the message ``vinvalbuf: dirty bufs''.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
MFC after: 1 week


# b9a4338d 08-Jan-2002 Mike Smith <msmith@FreeBSD.org>

Initialise the bioops vector hack at runtime rather than at link time. This
avoids the use of common variables.

Reviewed by: mckusick


# eb46fac5 27-Sep-2001 John Baldwin <jhb@FreeBSD.org>

- Fix some minor whitespace nits.
- Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a
little nit that caused it to be defined as -(sizeof (struct thread) + 1)
instead of -2.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# eb87cd75 13-Jun-2001 Kirk McKusick <mckusick@FreeBSD.org>

Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.


# 12396742 04-Jun-2001 David E. O'Brien <obrien@FreeBSD.org>

There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency. This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by: Tor.Egge@fast.no


# dc01275b 19-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.


# 9f5192ff 18-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

Must be a bit less aggressive about freeing pagedep structures.

Obtained from: Robert Watson <rwatson@FreeBSD.org> and
Matthew Jacob <mjacob@feral.com>


# 24a83a4b 17-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written. When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.


# 9ccb939e 08-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.


# 0c6fbff0 08-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.


# 3c7a8027 01-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Remove blatantly pointless call to VOP_BMAP().

Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 812b1d41 20-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Add kernel support for running fsck on active filesystems.


# 8775e64a 01-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Free lock before returning from process_worklist_item.

Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>


# a5a94e39 23-Feb-2001 Kirk McKusick <mckusick@FreeBSD.org>

Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.


# cc686e21 22-Feb-2001 Kirk McKusick <mckusick@FreeBSD.org>

When cleaning up excess inode dependencies, check for being done.

Reviewed by: Jan Koum <jkb@yahoo-inc.com>


# 2cf5d587 20-Feb-2001 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

softdep_setup_inomapdep
softdep_setup_allocdirect
softdep_setup_remove
softdep_setup_freeblocks
softdep_setup_directory_change
softdep_setup_directory_add
softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by: Peter Wemm <peter@yahoo-inc.com>,
Matt Dillon <dillon@earth.backplane.com>


# 37d40066 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# ef9e85ab 03-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Use <sys/queue.h> macro API.


# f8e071a1 29-Jan-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a race between the syncer and umount. When you umount a softupdates
filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running. It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race. This patch solves the problem.

Approved-by: mckusick


# 1d733bbd 13-Dec-2000 Kirk McKusick <mckusick@FreeBSD.org>

Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by: Paul Saab <ps@yahoo-inc.com>


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 959b7375 08-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize some malloc M_ instances.


# 71868b02 19-Nov-2000 Kirk McKusick <mckusick@FreeBSD.org>

More aggressively rate limit the growth of soft dependency structures
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.

Reviewed by: Paul Saab <ps@yahoo-inc.com>
Obtained from: Ethan Solomita <ethan@geocast.com>


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# bd4bd019 14-Nov-2000 Kirk McKusick <mckusick@FreeBSD.org>

When deleting a file, the ordering of events imposed by soft updates
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.


# 7eb9fca5 09-Oct-2000 Eivind Eklund <eivind@FreeBSD.org>

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


# 52a3bfa2 07-Sep-2000 Kirk McKusick <mckusick@FreeBSD.org>

Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK

Obtained from: Ethan Solomita <ethan@geocast.com>


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 9b971133 23-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 858c16fa 21-Jun-2000 Kirk McKusick <mckusick@FreeBSD.org>

Update to new copyright.


# 6019e620 18-Jun-2000 Kirk McKusick <mckusick@FreeBSD.org>

When running with quotas enabled on a filesystem using soft updates,
the system would panic when a user's inode quota was exceeded (see
PR 18959 for details). This fixes that problem.

PR: 18959
Submitted by: Jason Godsey <jason@unixguy.fidalgo.net>


# d3abb527 18-Jun-2000 Kirk McKusick <mckusick@FreeBSD.org>

Some additional performance improvements. When freeing an inode
check to see if it has been committed to disk. If it has never
been written, it can be freed immediately. For short lived files
this change allows the same inode to be reused repeatedly.
Similarly, when upgrading a fragment to a larger size, if it
has never been claimed by an inode on disk, it too can be freed
immediately making it available for reuse often in the next slowly
growing block of the same file.


# 75236818 16-Jun-2000 Poul-Henning Kamp <phk@FreeBSD.org>

ARGH! I have too many source trees :-(

Fix prototype errors in last commit.


# a2e7a027 16-Jun-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# a64ed089 14-Apr-2000 Robert Watson <rwatson@FreeBSD.org>

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# 4434ff1d 30-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

When writing out bitmap buffers, need to skip over ones that already
have a write in progress. Otherwise one can get in an infinite loop
trying to get them all flushed.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 57a91f6f 17-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

During fastpath processing for removal of a short-lived inode, the
set of restrictions for cancelling an inode dependency (inodedep)
is somewhat stronger than originally coded. Since this check appears
in two places, we codify it into the function check_inode_unwritten
which we then call from the two sites, one freeing blocks and the
other freeing directory entries.

Submitted by: Steinar Haug via Matthew Dillon


# 4c6adb06 17-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Need to reorganize the flushing of directory entry (pagedep) dependencies
so that they never try to lock an inode corresponding to ".." as this
can lead to deadlock. We observe that any inode with an updated link count
is always pushed into its buffer at the time of the link count change, so
we do not need to do a VOP_UPDATE, but merely find its buffer and write it.
The only time we need to get the inode itself is from the result of a
mkdir whose name will never be ".." and hence locking such an inode will
never request a lock above us in the filesystem tree. Thanks to Brian
Fundakowski Feldman for providing the test program that tickled soft updates
into hanging in "inode" sleep.

Submitted by: Brian Fundakowski Feldman <green@FreeBSD.org>


# 105ef72c 16-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Better bounding on softdep_flushfiles; other minor tweeks to checks.


# 107d5039 16-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Must track multiple uncommitted renames until one ultimately gets
committed to disk or is removed.


# 173cce7c 13-Jan-2000 Matthew Dillon <dillon@FreeBSD.org>

Non-operational change, fix compiler warning.

Reviewed by: mckusick


# d7127837 13-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Confirming Peter's fix (locking 101: release the lock before you go
to sleep). Locking 101, part 2: do not look at buffer contents after
you have been asleep. There is no telling what wonderous changes may
have occurred.


# 7f473504 13-Jan-2000 Peter Wemm <peter@FreeBSD.org>

Free the global softupdates lock prior to tsleep() in getdirtybuf().
This seems to be responsible for a bunch of panics where the process
sleeps and something else finds softupdates "locked" when it shouldn't
be. This commit is unreviewed, but has been a big help here.
Previously my boxes would panic pretty much on the first fsync() that
wrote something to disk.


# 1c2ceb28 13-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Because cylinder group blocks are now written in background,
it is no longer sufficient to get a lock on a buffer to know
that its write has been completed. We have to first get the
lock on the buffer, then check to see if it is doing a
background write. If it is doing background write, we have
to wait for the background write to finish, then check to see
if that fullfilled our dependency, and if not to start another
write. Luckily the explanation is longer than the fix.


# 94313add 13-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

A panic occurs during an fsync when a dirty block associated with
a vnode has not been written (which would clear certain of its
dependencies). The problems arises because fsync with MNT_NOWAIT
no longer pushes all the dirty blocks associated with a vnode. It
skips those that require rollbacks, since they will just get instantly
dirty again. Such skipped blocks are marked so that they will not be
skipped a second time (otherwise circular dependencies would never
clear). So, we fsync twice to ensure that everything will be written
at least once.


# 10767f84 10-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

We cannot proceed to free the blocks of the file until the dependencies
have been cleaned up by deallocte_dependencies(). Once that is done, it
is safe to post the request to free the blocks. A similar change is also
needed for the freefile case.


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 26e5527c 10-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Missing FREE_LOCK call before handle_workitem_freeblocks.

Submitted by: "Kenneth D. Merry" <ken@kdm.org>


# cf60e8e4 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# f0f7d383 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Keep tighter control of removal dependencies by limiting the number
of dirrem structure rather than the collaterally created freeblks
and freefile structures. Limit the rate of buffer dirtying by the
syncer process during periods of intense file removal.


# 3f5b28bc 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Reorganize softdep_fsync so that it only does the inode-is-flushed
check before the inode is unlocked while grabbing its parent directory.
Once it is unlocked, other operations may slip in that could make
the inode-is-flushed check fail. Allowing other writes to the inode
before returning from fsync does not break the semantics of fsync
since we have flushed everything that was dirty at the time of the
fsync call.


# 83aaf63a 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Make static non-exported functions from soft updates.


# 6a415224 16-Dec-1999 Kirk McKusick <mckusick@FreeBSD.org>

The function request_cleanup() had a tsleep() with PCATCH. It is
quite dangerous, since the process may hold locks at the point,
and if it is stopped in that tsleep the machine may hang. Because
the sleep is so short, the PCATCH is not required here, so it has
been removed. For the future, the FreeBSD team needs to decide
whether it is still reasonable to stop a process in tsleep, as that
may affect any other code that uses PCATCH while holding kernel locks.

Submitted by: Dmitrij Tejblum <tejblum@arc.hq.cti.ru>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 6bdfe06a 11-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# 38224dcd 22-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 0429e37a 20-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly mp != (void*)&mountlist comparisons.

Requested by: phk
Submitted by: Jake Burkholder jake@checker.org
PR: 14967


# 28065282 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 48703fed 29-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

No longer need to set B_ASYNC flag since BUF_KERNPROC now
unconditionally sets the identity of the buffer.


# a6451da7 27-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Keep the inlines for <sys/buf.h> happy..


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# f9c8cab5 16-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# e4ab40bc 15-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# 2e897e94 21-May-1999 Julian Elischer <julian@FreeBSD.org>

Cosmetic changes to make it compile without errors in gcc -Wall


# c2606ec5 13-May-1999 Kirk McKusick <mckusick@FreeBSD.org>

Add a hook to ffs_fsync to allow soft updates to get first chance at doing
a sync on the block device for the filesystem. That allows it to push the
bitmap blocks before the inode blocks which greatly reduces the number of
inode rollbacks that need to be done.


# 71a0942a 09-May-1999 Kirk McKusick <mckusick@FreeBSD.org>

Put back changes that might be causing trouble on Alpha.


# 7957996a 06-May-1999 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of random debugging cruft; sync up with latest version.


# 224a6aa2 06-May-1999 Kirk McKusick <mckusick@FreeBSD.org>

Severe slowdowns have been reported when creating or removing many
files at once on a filesystem running soft updates. The root of
the problem is that soft updates limits the amount of memory that
may be allocated to dependency structures so as to avoid hogging
kernel memory. The original algorithm just waited for the disk I/O
to catch up and reduce the number of dependencies. This new code
takes a much more aggressive approach. Basically there are two
resources that routinely hit the limit. Inode dependencies during
periods with a high file creation rate and file and block removal
dependencies during periods with a high file removal rate. I have
attacked these problems from two fronts. When the inode dependency
limits are reached, I pick a random inode dependency, UFS_UPDATE
it together with all the other dirty inodes contained within its
disk block and then write that disk block. This trick usually
clears 5-50 inode dependencies in a single disk I/O. For block and
file removal dependencies, I pick a random directory page that has
at least one remove pending and VOP_FSYNC its directory. That
releases all its removal dependencies to the work queue. To further
hasten things along, I also immediately start the work queue process
rather than waiting for its next one second scheduled run.


# 38e28fd6 01-Mar-1999 Kirk McKusick <mckusick@FreeBSD.org>

Reorganize locking to avoid holding the lock during calls to bdwrite
and brelse (which may sleep in some systems).

Obtained from: Matthew Dillon <dillon@apollo.backplane.com>


# 4cbb89d9 01-Mar-1999 Kirk McKusick <mckusick@FreeBSD.org>

Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries
if they ever arise (which should not happen as softdep_sync_metadata is
currently used).


# 133ff261 17-Feb-1999 Kirk McKusick <mckusick@FreeBSD.org>

fix double LIST_REMOVE; other cosmetic changes to match version 9.32.
Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>


# 8ab2fa00 22-Jan-1999 David Greenman <dg@FreeBSD.org>

Gutted softdep_deallocate_dependencies and replaced it with a panic. It
turns out to not be useful to unwind the dependencies and continue in
the face of a fatal error.
Also changed the log() to a printf() in softdep_error() so that it will
be output in the case of a impending panic.
Submitted by: Kirk McKusick <mckusick@mckusick.com>


# de5d1ba5 07-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


# 4591d9bb 06-Jan-1999 Bruce Evans <bde@FreeBSD.org>

UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value
MNT_WAIT when we mean boolean `true' or check for that value not being
passed. There was no problem in practice because MNT_WAIT had the
magic value of 1.


# 1f35e8c8 10-Dec-1998 Julian Elischer <julian@FreeBSD.org>

Remove some compiler warnings.


# 2ec07c66 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Change dirty block list handling to use TAILQ macros.


# 2dcc2f06 28-Oct-1998 Jordan K. Hubbard <jkh@FreeBSD.org>

Clarify a rather ambiguous debugging message.


# ed8d80c2 03-Oct-1998 Nate Williams <nate@FreeBSD.org>

Fix 'noatime' bug that was unrelated to use of noatime.

The problem is caused when a directory block is compacted. When this
occurs, softdep_change_directoryentry_offset() is called to relocate each
directory entry and adjust its matching diradd structure, if any, to match
the new location of the entry. The bug is that while
softdep_change_directoryentry_offset() correctly adjusts the offsets of
the diradd structures on the pd_diraddhd[] lists (which are not yet ready
to be committed to disk), it fails to adjust the offsets of the diradd
structures on the pd_pendinghd list (which are ready to be committed to
disk). This causes the dependency structures to be inconsistent with
the buf contents. Now, if the compaction has moved a directory entry to
the same offset as one of the diradd structures on the pd_pendinghd list
*and* a syscall is done that tries to remove this directory entry before
this directory block has been written to disk (which would empty
pd_pendinghd), a sanity check in newdirrem() will call panic() when it
notices that the inode number in the entry that it is to be removed doesn't
match the inode number in the diradd structure with that offset of that
entry.

Reviewed by: Kirk McKusick <mckusick@McKusick.COM>
Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>


# e266594c 24-Sep-1998 Luoqi Chen <luoqi@FreeBSD.org>

Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
a little more unstable. It also keeps us in sync with other BSDs.
Suggested by: Bruce Evans <bde@zeta.org.au>


# 55d80b2d 12-Aug-1998 Julian Elischer <julian@FreeBSD.org>

Handle the case of moving a directory onto the top of a sibling's
child of the same name.

Submitted by: Kirk Mckusick with fixes from luoqi Chen
Obtained from: Whistle test tree.


# 28ed0326 12-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Note which version of Kirk's sources this corresponds to.


# aa75cb86 12-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Fix the case when renaming to a file that you've just created and deleted,
that had an inode that has not yet been written to disk, when the inode of the
new file is also not yet written to disk, and your old directory entry is not
yet on disk but you need to remove it and the new name exists in memory
but has been deleted but the transaction to write the deleted name to disk
exists and has not yet been cancelled by the request to delete the non
existant name. I don't know how kirk could have missed such a glaring
problem for so long. :-) Especially since the inconsitency survived on
the disk for a whole 4 second on average before being fixed by other code.
This was not a crashing bug but just led to filesystem inconsitencies
if you crashed.

Submitted by: Kirk McKusick (mckusick@mckusick.com)


# 6d0ba442 11-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Add B_NOCACHE to several cases where BSD4.4 only required a B_INVAL.
Change worked out by john and kirk in consort.


# 8c221701 10-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Fix for "live inode" panic.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Reviewed by: yeah right...


# 4af0bb0f 10-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Remove buggy debugging code.


# b8cf4de4 26-May-1998 Julian Elischer <julian@FreeBSD.org>

A fix to a debug test from Kirk.


# 25db4e8a 19-May-1998 Julian Elischer <julian@FreeBSD.org>

Bring up-to-date with Whistle's current version
Includes some debugging code.


# 46e752be 19-May-1998 Julian Elischer <julian@FreeBSD.org>

Merge with Kirk's version as of Feb 20

His version 9.23 == our version 1.5 of ffs_softdep.c
His version 9.5 == our version 1.4 of softdep.c


# 62e12c76 19-May-1998 Julian Elischer <julian@FreeBSD.org>

Merge in Kirk's changes to stop softupdates from hogging all of memory.


# b6dad363 19-May-1998 Julian Elischer <julian@FreeBSD.org>

Change to stop a silly panic. This should be understood better.
Change a buffer swizzle trick to a bcopy. It would be nice if the efficient
trick could be used in the future.


# 987614a9 19-May-1998 Julian Elischer <julian@FreeBSD.org>

First published FreeBSD version of soft updates Feb 5.


# 8e95b94d 19-May-1998 Julian Elischer <julian@FreeBSD.org>

Import the next version received from kirk after some
FreeBSD feedback.


# 467e1a6e 19-May-1998 Julian Elischer <julian@FreeBSD.org>

Import the earliest version of the soft update code that I have.