History log of /freebsd-current/sys/ufs/ffs/ffs_inode.c
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# c656f5c1 27-Oct-2023 Navdeep Parhar <np@FreeBSD.org>

Fix build with gcc12.


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 831b1ff7 27-Jul-2023 Kirk McKusick <mckusick@FreeBSD.org>

UFS/FFS: Migrate to modern uintXX_t from u_intXX_t.

As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.

No functional changes intended.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 2e66649e 13-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Another fix to build from 064e6b4.

Spotted by: Cy Schubert


# 064e6b43 13-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Rewrite function definitions in the UFS/FFS code base with identifier lists.

The K&R style in UFS and other places in the tree's days are numbered
as this syntax is removed in C2x proposal N2432:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf

Though running to nearly 6000 lines of diffs this update should cause
no functional change to the code.

Requested by: Warner Losh
MFC after: 2 weeks


# 8d8589b3 17-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

ufs: be more persistent with finishing some operations

when the vnode is doomed after relock. The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data. Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.

Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.

Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072


# 5b13fa79 02-Jan-2022 Jessica Clarke <jrtc27@FreeBSD.org>

ufs: Rework shortlink handling to avoid subobject overflows

Shortlinks occupy the space of both di_db and di_ib when used. However,
everywhere that wants to read or write a shortlink takes a pointer do
di_db and promptly runs off the end of it into di_ib. This is fine on
most architectures, if a little dodgy. However, on CHERI, the compiler
can optionally restrict the bounds on pointers to subobjects to just
that subobject, in order to mitigate intra-object buffer overflows, and
this is enabled in CheriBSD's pure-capability kernels.

Instead, clean this up by inserting a union such that a new di_shortlink
can be added with the right size and element type, avoiding the need to
cast and allowing the use of the DIP macro to access the field. This
also mirrors how the ext2fs code implements extents support, with the
exact same structure other than having a uint32_t i_data[] instead of a
char di_shortlink[].

Reviewed by: mckusick, jhb
Differential Revision: https://reviews.freebsd.org/D33650


# 2030ee0e 19-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

ufs: remove write-only variables

Mark variables as __diagused for invariant-only vars

Reviewed by: imp, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32577


# 9acea164 02-Oct-2021 Robert Wing <rew@FreeBSD.org>

ffs: retire unused fsckpid mount option

The fsckpid mount option was introduced in 927a12ae16433b50 along with a
couple sysctl's to support SU+J with snapshots. However, those sysctl's
were never used and eventually removed in f2620e9ceb3ede02.

There are no in-tree consumers of this mount option.

Reviewed by: mckusick, kib
Differential Revision: https://reviews.freebsd.org/D32015


# bb536de6 26-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_update(): Do not assume that EBUSY can only come LK_NOWAIT trylock

Instead do protective check for the local flags and do not interpret
EBUSY specially if we did not request trylock mode for bread().

Reviewed by: mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f822d4fe 26-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs_update(): recalculate flags after relocking the vnode

Inode type could migrate between snapshot and regular types while the
vnode is unlocked. Recalculate flags specific for snapshot after relock.

Reviewed by: mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f784da88 17-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Move mnt_maxsymlinklen into appropriate fs mount data structures

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-Note: struct mount layout
Differential revision: https://reviews.freebsd.org/D30325


# 9a2fac6b 16-May-2021 Kirk McKusick <mckusick@FreeBSD.org>

Fix handling of embedded symbolic links (and history lesson).

The original filesystem release (4.2BSD) had no embedded sysmlinks.
Historically symbolic links were just a different type of file, so
the content of the symbolic link was contained in a single disk block
fragment. We observed that most symbolic links were short enough that
they could fit in the area of the inode that normally holds the block
pointers. So we created embedded symlinks where the content of the
link was held in the inode's pointer area thus avoiding the need to
seek and read a data fragment and reducing the pressure on the block
cache. At the time we had only UFS1 with 32-bit block pointers,
so the test for a fastlink was:

di_size < (NDADDR + NIADDR) * sizeof(daddr_t)

(where daddr_t would be ufs1_daddr_t today).

When embedded symlinks were added, a spare field in the superblock
with a known zero value became fs_maxsymlinklen. New filesystems
set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded
symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus
filesystems that preceeded this change always read from blocks
(since fs->fs_maxsymlinklen == 0) and newer ones used embedded
symlinks if they fit. Similarly symlinks created on pre-embedded
symlink filesystems always spill into blocks while newer ones will
embed if they fit.

At the same time that the embedded symbolic links were added, the
on-disk directory structure was changed splitting the former
u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen.
Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can
be used to distinguish old directory formats. In retrospect that
should have just been an added flag, but we did not realize we
needed to know about that change until it was already in production.

Code was split into ufs/ffs so that the log structured filesystem could
use ufs functionality while doing its own disk layout. This meant
that no ffs superblock fields could be used in the ufs code. Thus
ffs superblock fields that were needed in ufs code had to be copied
to fields in the mount structure. Since ufs_readlink needed to know
if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen.

The kernel panic that arose to making this fix was triggered when a
disk error created an inode of type symlink with no allocated data
blocks but a large size. When readlink was called the uiomove was
attempted which segment faulted.

static int
ufs_readlink(ap)
struct vop_readlink_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
} */ *ap;
{
struct vnode *vp = ap->a_vp;
struct inode *ip = VTOI(vp);
doff_t isize;

isize = ip->i_size;
if ((isize < vp->v_mount->mnt_maxsymlinklen) ||
DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */
return (uiomove(SHORTLINK(ip), isize, ap->a_uio));
}
return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred));
}

The second part of the "if" statement that adds

DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */

is problematic. It never appeared in BSD released by Berkeley because
as noted above mnt_maxsymlinklen is 0 for old format filesystems, so
will always fall through to the VOP_READ as it should. I had to dig
back through `git blame' to find that Rodney Grimes added it as
part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.''
He must have brought it across from an earlier FreeBSD. Unfortunately
the source-control logs for FreeBSD up to the merger with the
AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the
agreement to let FreeBSD remain unencumbered, so I cannot pin-point
where that line got added on the FreeBSD side.

The one change needed here is that mnt_maxsymlinklen is declared as
an `int' and should be changed to be `u_int64_t'.

This discovery led us to check out the code that deletes symbolic
links. Specifically

if (vp->v_type == VLNK &&
(ip->i_size < vp->v_mount->mnt_maxsymlinklen ||
datablocks == 0)) {
if (length != 0)
panic("ffs_truncate: partial truncate of symlink");
bzero(SHORTLINK(ip), (u_int)ip->i_size);
ip->i_size = 0;
DIP_SET(ip, i_size, 0);
UFS_INODE_SET_FLAG(ip, IN_SIZEMOD | IN_CHANGE | IN_UPDATE);
if (needextclean)
goto extclean;
return (ffs_update(vp, waitforupdate));
}

Here too our broken symlink inode with no data blocks allocated
and a large size will segment fault as we are incorrectly using the
test that we have no data blocks to decide that it is an embdedded
symbolic link and attempting to bzero past the end of the inode.
The test for datablocks == 0 is unnecessary as the test for
ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right
thing in all cases.

The test for datablocks == 0 was added by David Greenman in this commit:

Author: David Greenman <dg@FreeBSD.org>
Date: Tue Aug 2 13:51:05 1994 +0000

Completed (hopefully) the kernel support for old style "fastlinks".

Notes:
svn path=/head/; revision=1821

I am guessing that he likely earlier added the incorrect test in the
ufs_readlink code.

I asked David if he had any recollection of why he made this change.
Amazingly, he still had a recollection of why he had made a one-line
change more than twenty years ago. And unsurpisingly it was because
he had been stuck between a rock and a hard place.

FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code
base. Prior to that, there were three years of development in all
areas of the kernel, including the filesystem code, from the combined
set of people including Bill Jolitz, Patchkit contributors, and
FreeBSD Project members. The compatibility issue at hand was caused
by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite
changes David had to find a way to provide compatibility with both
the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite.
He felt that these changes would provide compatibility with both systems.

In his words:
``My recollection is that the 'FASTLINKS' symlinks support in
FreeBSD-1.x, as implemented by Curt Mayer, worked differently than
4.4BSD. He used a spare field in the inode to duplicately store the
length. When the 4.4BSD-Lite merge was done, the optimized symlinks
support for existing filesystems (those that were initialized in
FreeBSD-1.x) were broken due to the FFS on-disk structure of
4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to
restore the backward compatibility with FreeBSD-1.x filesystems.
I think it was the best that could be done in the somewhat urgent
circumstances of the post Berkeley-USL settlement. Also, regarding
Rod's massive commit with little explanation, some context: John
Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to
the 386 platform in just 10 days. It was by far the most intense
hacking effort of my life. In addition to the porting of tons of
FreeBSD-1 code, I think we wrote more than 30,000 lines of new code
in that time to deal with the missing pieces and architectural
changes of 4.4BSD-Lite. We didn't make many notes along the way.
There was a lot of pressure to get something out to the rest of the
developer community as fast as possible, so detailed discrete commits
didn't happen - it all came as a giant wad, which is why Rod's
commit message was worded the way it was.''

Reported by: Chuck Silvers
Tested by: Chuck Silvers
History by: David Greenman Lawrence
MFC after: 1 week
Sponsored by: Netflix


# 2bfd8992 14-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

vnode: move write cluster support data to inodes.

The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679


# e94f2f1b 28-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

ffs: call ufsdirhash_dirtrunc() right after setting directory size

Later processing of ffs_truncate() might temporary unlock the directory
vnode, causing unsychronized dirhash and inode sizes if update is
postponed to UFS_TRUNCATE() callers.

Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 8a1509e4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Handle LoR in flush_pagedep_deps().

When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock. Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed. This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps(). The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed. All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP. For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied. Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes. [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged. A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive(). In this case, dropping the
vnode lock is not safe. Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT. ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 738ea001 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Add ffs_inode_bwrite() helper.

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 7b795aa3 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Revert r367669 to re-commit with proper message


# c0d2077f 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a framework that tracks exclusive vnode lock generation count for UFS.

This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored. Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents. If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter. Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock. It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# d90f2c36 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: clean up empty lines in .c and .h files


# 513274c7 06-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Clear the IN_SIZEMOD and IN_IBLKDATA flags only when doing a
synchronous inode update.

The IN_SIZEMOD and IN_IBLKDATA flags indicate changes to the
file size and block pointer fields in the inode. When these
fields have been changed, the fsync() and fsyncdata() system
calls must write the inode to ensure their semantics that the
file is on stable store.

The IN_SIZEMOD and IN_IBLKDATA flags cannot be cleared until
a synchronous write of the inode is done. If they are cleared
on an asynchronous write, then the inode may not yet have been
written to the disk when an fsync() or fsyncdata() call is done.
Absent these flags, these calls would not know that they needed
to write the inode. Thus, these flags only can be cleared on
synchronous writes of the inode. Since the inode will be locked
for the duration of the I/O that writes it to disk, no fsync()
or fsyncdata() will be able to run before the on-disk inode
is complete.

Reviewed by: kib
MFC with: -r361785
Differential revision: https://reviews.freebsd.org/D25072


# 52488b51 04-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Further evaluation of the POSIX spec for fdatasync() shows that it
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.

This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.

Reviewed by: kib
MFC with: -r361785
Differential revision: https://reviews.freebsd.org/D25072


# 7428630b 03-Jun-2020 Konstantin Belousov <kib@FreeBSD.org>

UFS: write inode block for fdatasync(2) if pointers in inode where allocated

The fdatasync() description in POSIX specifies that
all I/O operations shall be completed as defined for synchronized I/O
data integrity completion.
and then the explanation of Synchronized I/O Data Integrity Completion says
The write is complete only when the data specified in the write
request is successfully transferred and all file system
information required to retrieve the data is successfully
transferred.

For UFS this means that all pointers must be on disk. Indirect
pointers already contribute to the list of dirty data blocks, so only
direct blocks and root pointers to indirect blocks, both of which
reside in the inode block, should be taken care of. In ffs_balloc(),
mark the inode with the new flag IN_IBLKDATA that specifies that
ffs_syncvnode(DATA_ONLY) needs a call to ffs_update() to flush the
inode block.

Reviewed by: mckusick
Discussed with: tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25072


# d79ff54b 25-May-2020 Chuck Silvers <chs@FreeBSD.org>

This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.

- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088


# 621a2748 09-Apr-2020 Kirk McKusick <mckusick@FreeBSD.org>

Fixing the soft update macros in -r359612 triggered a previously
hidden bug in the file truncation code. Until that bug is tracked
down and fixed, revert to the old behavior.

Reported by: Peter Holm
Reviewed by: kib, Chuck Silvers


# c79f5a43 06-Apr-2020 Kirk McKusick <mckusick@FreeBSD.org>

Revert -r359612 as it can cause other panics.
An updated version will be made when the issue has been resolved.

Reported by: Peter Holm


# 2baca885 03-Apr-2020 Kirk McKusick <mckusick@FreeBSD.org>

When shrinking the size of a directory it is sometimes necessary to
sync it to disk before shrinking it. Complete the sync before getting
the buffer for the block to be updated to do the shrink to avoid
panicing with a recursive lock on one of the directory's buffers.

Reviewed by: Chuck Silvers (chs)
MFC after: 3 days
Sponsored by: Netflix


# ac4ec141 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

ufs: add a setter for inode i_flag field

This will be used later to add vnodes to the lazy list.

Reviewed by: kib (previous version), jeff
Tested by: pho (in a larger patch)
Differential Revision: https://reviews.freebsd.org/D22994


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# d00066a5 03-Dec-2019 Kirk McKusick <mckusick@FreeBSD.org>

Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 90381b1c 31-Jul-2019 Kirk McKusick <mckusick@FreeBSD.org>

When updating the user or group disk quotas for the return of inodes or
disk blocks, set the FORCE flag in the call to chkiq() or chkdq() since
the user is always allowed to return resources and hence there is no need
to check the user's credential .

Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-1-UFS-1: Denial Of Service in mount (prison_priv_check)
Discussed with: kib
MFC: 1 week
Sponsored by: Netflix


# 65417f5e 24-May-2019 Alan Somers <asomers@FreeBSD.org>

Remove "struct ucred*" argument from vtruncbuf

vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377


# 35327182 11-Mar-2019 Kirk McKusick <mckusick@FreeBSD.org>

Give more complete information in INVARIANTS panic messages at end of
the ffs_truncate() function.

Sponsored by: Netflix


# 8f829a5c 11-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS. This change adds a
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.

Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.

Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 9fc5d538 13-Nov-2018 Kirk McKusick <mckusick@FreeBSD.org>

In preparation for adding inode check-hashes, clean up and
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.

Convert the utilities that had been using the undocumented
interface to use the new documented interface.

No functional change (as for now the libufs library does not
do inode check-hashes).

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 19fa89e9 25-Aug-2018 Mark Murray <markm@FreeBSD.org>

Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898


# 7e038bc2 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by: Peter Holm
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix


# cc91864c 18-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.


# 68c49bcc 06-Aug-2018 Kirk McKusick <mckusick@FreeBSD.org>

Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 75e3597a 21-Sep-2017 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.

sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)

sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)

sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 1dc349ab 15-Feb-2017 Ed Maste <emaste@FreeBSD.org>

prefix UFS symbols with UFS_ to reduce namespace pollution

Specifically:
ROOTINO -> UFS_ROOTINO
WINO -> UFS_WINO
NXADDR -> UFS_NXADDR
NDADDR -> UFS_NDADDR
NIADDR -> UFS_NIADDR
MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by: kib, mckusick
Obtained from: NetBSD
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9536


# e1db6897 17-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Reduce size of ufs inode.

Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding. To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2. Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by: pho (previous version)
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 0c01bcb9 08-Sep-2016 Bruce Evans <bde@FreeBSD.org>

Sprinkle DOINGASYNC() checks so as to do delayed writes for async
mounts in almost all cases instead of in most cases. Don't override
DOINGASYNC() by any condition except IO_SYNC.

Fix previous sprinking of DOINGASYNC() checks. Don't override IO_SYNC
by DOINGASYNC(). In ffs_write() and ffs_extwrite(), there were
intentional overrides that just broke O_SYNC of data. In
ffs_truncate(), there are 5 calls to ffs_update(), 4 with
apparently-unintentional overrides and 1 without; this had no effect
due to the main async mount hack descibed below.

Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden
for the soft updates case too (to do a delayed write instead of a sync
write). This is supposed to be the only change that affects anything
except async mounts.

In ffs_update(), remove the 19 year old efficiency hack of ignoring
the waitfor flag for async mounts, so that fsync() almost works for
async mounts. All callers are supposed to be fixed to not ask for a
sync update unless they are for fsync() or [I]O_SYNC operations.
fsync() now almost works for async mounts. It used to sync the data
but not the most important metdata (the inode). It still doesn't sync
associated directories.

This gave 10-20% fewer writes for my makeworld benchmark with async
mounted tmp and obj directories from an already small number.

Style fixes:
- in ffs_balloc.c, remove rotted quadruplicated comments about the
simplest part of the DOING*() decisions and rearrange the nearly-
quadruplicated code to be more nearly so.
- in ufs_vnops.c, use a consistent style with less negative logic and
no manual "optimization" of || to | in DOING*() expressions.

Reviewed by: kib (previous version)


# f30ddc49 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

If IO_SYNC was passed to ffs_truncate(), request synchronous inode
update from the final ffs_update().

Noted by: bde
MFC after: 1 week


# ae34b6ff 06-Apr-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080


# 24b6748c 23-Feb-2016 Kirk McKusick <mckusick@FreeBSD.org>

The UFS filesystem requires that the last block of a file always be
allocated. When shortening the length of a file in which the new end
of the file contains a hole, the hole must have a block allocated.

Reported by: Maxim Sobolev
Reviewed by: kib
Tested by: Peter Holm


# d2a28cb0 27-Jan-2016 Kirk McKusick <mckusick@FreeBSD.org>

The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib


# d1b06863 30-Jun-2015 Mark Murray <markm@FreeBSD.org>

Huge cleanup of random(4) code.

* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
random(4) device in the kernel, and the KERN_ARND sysctl will
return nothing. With RANDOM_DUMMY there will be a random(4) that
always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
(direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
that instead. All that is needed here is N=0, N++, N==0, and some
localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
now the test will do a basic check of compressibility. Clairvoyant
talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)


# 22a72260 30-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# fe85d98a 03-Feb-2013 Kirk McKusick <mckusick@FreeBSD.org>

For UFS2 i_blocks is unsigned. The current "sanity" check that it
has gone below zero after the blocks in its inode are freed is a
no-op which the compiler fails to warn about because of the use of
the DIP macro. Change the sanity check to compare the number of
blocks being freed against the value i_blocks. If the number of
blocks being freed exceeds i_blocks, just set i_blocks to zero.

Reported by: Pedro Giffuni (pfg@)
MFC after: 2 weeks


# c52fd858 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# 6c09f4a2 28-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.

This change will be included along with 232351 when it is MFC'ed to 9.

Spotted by: kib
Reviewed by: kib


# 75a58389 24-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib


# 92ccae03 11-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove superfluous brackets.

Submitted by: alc
MFC after: 2 weeks


# dd522d76 11-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Do schedule delayed writes for async mounts.
While there, make some style adjustments, like missed () around
return values.

Submitted by: bde
Reviewed by: mckusick
Tested by: pho
MFC after: 2 weeks


# 2fd2c0b1 11-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not fall back to slow synchronous i/o when low on memory or buffers.
The bawrite() schedules the write to happen immediately, and its use
frees the current thread to do more cleanups.

Submitted by: bde
Reviewed by: mckusick
Tested by: pho
MFC after: 2 weeks


# 35338e60 01-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks


# 82378711 25-Aug-2011 Martin Matuska <mm@FreeBSD.org>

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


# 927a12ae 15-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.


# 6bbee8e2 29-Jun-2011 Alan Cox <alc@FreeBSD.org>

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


# 43a3cc77 15-Jun-2011 Kirk McKusick <mckusick@FreeBSD.org>

Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by: Jeff Roberson


# 280e091a 10-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# fae5c47d 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by: jeff
Tested by: pho


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 9f9c8c59 06-Jul-2010 Jeff Roberson <jeff@FreeBSD.org>

- Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.

Discussed with: mckusick


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# ec7e66e8 27-Jan-2009 Robert Watson <rwatson@FreeBSD.org>

Following a fair amount of real world experience with ACLs and
extended attributes since FreeBSD 5, make the following semantic
changes:

- Don't update the inode modification time (mtime) when extended
attributes (and hence also ACLs) are added, modified, or removed.
- Don't update the inode access tie (atime) when extended attributes
(and hence also ACLs) are queried.

This means that rsync (and related tools) won't improperly think
that the data in the file has changed when only the ACL has changed.

Note that ffs_reallocblks() has not been changed to not update on an
IO_EXT transaction, but currently EAs don't use the cluster write
routines so this shouldn't be a problem. If EAs grow support for
clustering, then VOP_REALLOCBLKS() will need to grow a flag argument
to carry down IO_EXT to UFS.

MFC after: 1 week
PR: ports/125739
Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: pluknet <pluknet@gmail.com>,
Greg Byshenk <freebsd@byshenk.net>
Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>


# b51b07be 20-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

The r187467 should remove all pages for V_NORMAL case too, because
indirect block pages are not removed by the mentioned invocation of
the vnode_pager_setsize().

Put a common code into the helper function ffs_pages_remove().

Reported and tested by: dchagin
Reviewed by: ups
MFC after: 3 weeks


# b1a4c8e5 20-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

When extending inode size, we call vnode_pager_setsize(), to have a
address space where to put vnode pages, and then call UFS_BALLOC(),
to actually allocate new block and map it. When UFS_BALLOC() returns
error, sometimes we forget to revert the vm object size increase,
allowing for the pages that are not backed by the logical disk blocks.

Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for
ffs_truncate() and ffs_write().

PR: 129956
Reviewed by: ups
MFC after: 3 weeks


# 9316467d 20-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by: csjp
Reviewed by: ups
MFC after: 3 weeks


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0d7935fd 10-Oct-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 90446e36 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

When downgrading the read-write mount to read-only, do_unmount() sets
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.

Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 698b1a66 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# 1102b89b 08-Nov-2007 David E. O'Brien <obrien@FreeBSD.org>

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# ec7a247a 10-Oct-2006 Konstantin Belousov <kib@FreeBSD.org>

Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem
is suspending/suspended. Doing so may result in deadlock. Instead, set the
(new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.

Change the locking protocol in order to set the IN_ACCESS and timestamps
without upgrading shared vnode lock to exclusive (see comments in the
inode.h). Before that, inode was modified while holding only shared
lock.

Tested by: Peter Holm
Reviewed by: tegge, bde
Approved by: pjd (mentor)
MFC after: 3 weeks


# ece9473e 05-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Consistently call 'vp' vp rather than ovp sometimes in ffs_truncate().
Do the same for oip.

Pointed out by: glebius


# b5411d4f 12-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Fix an assert now that the XLOCK no longer exists.

Sponsored by: Isilon Systems, Inc.


# 1a4a9672 22-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add VOP locking asserts in several functions that have been implicated in
recent deadlocks.


# a3caf16e 09-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- In the softupdates case for ffs_truncate() we use vinvalbuf() to
invalidate pending io and dependencies. However, vinvalbuf() rightfully
does not call vnode_pager_setsize() for us. We must do this here. This
could potentially have caused numerous kinds of bugs, but it was
specifically causing msync() deadlocks because msync() was writing
flushing pages that should not have been valid.

Sponsored by: Isilon Systems, Inc.
Reported by: kkenn


# efd6d980 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


# 40854ff5 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 5c77b03e 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Acquire the ufs lock when manipulating some fields of struct fs.
- Change arguments to various ffs functions to match their new
prototypes.

Sponsored By: Isilon Systems, Inc.


# 7c0745ee 14-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 156cb265 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# b792bebe 24-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# b403319b 28-Jul-2004 Alexander Kabaev <kan@FreeBSD.org>

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# 012d4134 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.

Approved by: core, rwatson


# 2c18019f 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

DuH!

bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)


# 4e1694ec 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY()


# cffa37d4 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- In ffs_update() assert that either the vnode lock or the XLOCK is held.


# f4636c59 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 7261f5f6 03-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 86270230 02-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


# 2ee5711e 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Convert locks to use standard macros.
- Lock access to the buflists.
- Document broken locking.
- Use vrefcnt().


# 98caa2e4 05-Aug-2002 Ian Dowse <iedowse@FreeBSD.org>

Don't call softdep_slowdown() if soft updates are not active on the
filesystem. This causes a panic for kernels compiled without
softupdates.

Reported by: luigi


# 7aca6291 19-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 10cfbc19 23-Jun-2002 Matthew Dillon <dillon@FreeBSD.org>

Rename the BALLOC flags from B_* to BA_* to avoid confusion with the
struct buf B_ flags.

Approved by: mckusick


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 05f4ff5d 13-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove register keyword.

Sponsored by: DARPA & NAI Labs.
Submitted by: mckusick


# 6f1e8551 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# c9f96392 01-Feb-2002 Kirk McKusick <mckusick@FreeBSD.org>

When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.


# 8af31e7b 15-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

Put write on read-only filesystem panic after we have weeded out
block and character devices, fifo's, etc.

Submitted by: Bruce Evans <bde@zeta.org.au>


# cd600596 15-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by: "Sam Leffler" <sam@errno.com>


# 9db12e51 12-Dec-2001 Kirk McKusick <mckusick@FreeBSD.org>

When a file is partially truncated, we first check to see if the
new file end will land in the middle of a file hole. Since the last
block of a file must always be allocated, the hole is filled by
allocating a block at that location. If the hole being filled is
a direct block, then the truncation may eventually reduce the
full sized block down to a fragment. When running with soft
updates, it is necessary to FSYNC the file after allocating the
block and before creating the fragment to avoid triggering a
soft updates inconsistency when the block unexpectedly shrinks.

Found by: Matthew Dillon <dillon@apollo.backplane.com>
MFC after: 1 week


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 9ccb939e 08-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.


# 855aa097 28-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 6da443cb 18-Dec-2000 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of spurious check in ffs_truncate for i_size == length
which fails to set the modification time on the file. The same
check a few lines later takes the correct action.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


# 1d733bbd 13-Dec-2000 Kirk McKusick <mckusick@FreeBSD.org>

Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by: Paul Saab <ps@yahoo-inc.com>


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 3592b715 26-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 87150cb0 29-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

s/biowait/bufwait/g

Prodded by: several.


# eb95c536 29-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded #include <sys/kernel.h>


# a64ed089 14-Apr-2000 Robert Watson <rwatson@FreeBSD.org>

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 4dc0c8f5 13-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has
been set for a mount point. Insert missing checks to ensure that all
write operations are done asynchronously when the MNT_ASYNC option
has been requested.

Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 4221e284 02-May-1999 Alan Cox <alc@FreeBSD.org>

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# de5d1ba5 07-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


# 4591d9bb 06-Jan-1999 Bruce Evans <bde@FreeBSD.org>

UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value
MNT_WAIT when we mean boolean `true' or check for that value not being
passed. There was no problem in practice because MNT_WAIT had the
magic value of 1.


# 5991fd03 06-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Backed out rev.1.47. It just broke my optimisations for lazy syncing
of timestamps in rev.1.45. The soft updates bug was elsewhere.

Forgotten by: luoqi


# 40c8cfe5 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# f9e84c2f 15-Sep-1998 Luoqi Chen <luoqi@FreeBSD.org>

Restore pre-v1.44 behavior: always copy modified in-core inode to disk
buffer. Otherwise some in-core inode changes might be lost, including
important meta data (e.g. size) if softupdates is enabled.


# fd5d1124 04-Jul-1998 Julian Elischer <julian@FreeBSD.org>

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


# 30551872 03-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Sync timestamp changes for inodes of special files to disk as late
as possible (when the inode is reclaimed). Temporarily only do
this if option UFS_LAZYMOD configured and softupdates aren't enabled.
UFS_LAZYMOD is intentionally left out of /sys/conf/options.

This is mainly to avoid almost useless disk i/o on battery powered
machines. It's silly to write to disk (on the next sync or when the
inode becomes inactive) just because someone hit a key or something
wrote to the screen or /dev/null.

PR: 5577
Previous version reviewed by: phk


# 33cc029e 03-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Centralized in-core inode update. Update the in-core inode directly
in ufs_setattr() so that there is no need to pass timestamps to
UFS_UPDATE() (everything else just needs the current time). Ignore
the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes()
(was: itimes()) to do the update. The timestamps are still passed
so that all the callers don't need to be changed yet.


# c619155f 14-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Slight change to directory cleanup
Makes soft updates a bit cleaner. Eliminates some warnings about
'corrupted directories' from fsck.


# e60606c0 04-May-1998 John Dyson <dyson@FreeBSD.org>

Correct an error that I made where the vtruncbuf was changed back
to vinvalbuf, but I incorrectly added the "V_SAVE|V_SAVEMETA" flags.
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>


# 83ad4e3d 29-Apr-1998 John Dyson <dyson@FreeBSD.org>

Fix an error that I made with an optimization. In the case
of softupdates, we need to do vtruncbuf the old way. Luoqi
caught, found the bug and submitted this fix.
Submitted by: Luoqi Chen <luoqi@chen.ml.org>


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# a0502b19 26-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Add two new functions, get{micro|nano}time.

They are atomic, but return in essence what is in the "time" variable.
gettime() is now a macro front for getmicrotime().

Various patches to use the two new functions instead of the various
hacks used in their absence.

Some puntuation and grammer patches from Bruce.

A couple of XXX comments.


# 34f72be5 19-Mar-1998 John Dyson <dyson@FreeBSD.org>

Fix vfs_bio_awrite usage, and correct vtruncbuf usage.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 9cfcd011 01-Feb-1998 John Dyson <dyson@FreeBSD.org>

Back out recent laptop sync changes. They had significant errors.


# de1050d8 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Support more intelligent sync operations for MNT_NOATIME.
PR: kern/5577
Submitted by: Craig Leres <leres@ee.lbl.gov>


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 987f5696 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


# cec0f20c 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 3c816944 21-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Fixed some invalid (non-atomic) accesses to `time', mostly ones of the
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# e75838f7 05-Nov-1996 David Greenman <dg@FreeBSD.org>

Eliminate an unnecessary synchronous write (and an 8K bcopy+bzero) when
truncating/deleting large files.

Reviewed by: mckusick, dyson
Submitted by: Kirk McKusick <mckusick@mckusick.com>, modified for
FreeBSD by me.


# 030e2e9e 19-Sep-1996 Nate Williams <nate@FreeBSD.org>

In sys/time.h, struct timespec is defined as:

/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};

The correct names of the fields are tv_sec and tv_nsec.

Reminded by: James Drobina <jdrobina@infinet.com>


# e1eec28a 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 01733a9b 05-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert QUOTA to new-style option.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 4dc4e0d2 05-Nov-1995 John Dyson <dyson@FreeBSD.org>

Make MNT_ASYNC more effective for UFS. It should not be too much more
dangerous than the original MNT_ASYNC. There might be some minor
security considerations due to data writes not being posted as promptly
as before. Meta-data operations are still not quite as fast as Linux,
but streaming I/O is still higher.


# a9e479e2 16-Aug-1995 David Greenman <dg@FreeBSD.org>

Honor -async mount option when doing the inode update.

Obtained from: 4.4BSD-Lite2


# 7221ecc0 03-Aug-1995 David Greenman <dg@FreeBSD.org>

Use the correct flags (IO_SYNC -> B_SYNC) when deciding to do a sync or
async write in the section that changes the filesize. The bug resulted
in the updates always being async.

Obtained from: 4.4BSD-Lite2


# f57459b6 26-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed third arg (vmio) to allocbuf() that was added with the original
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.

Submitted by: John Dyson


# 403ef252 03-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed obsolete vtrace() remnants.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 59f08b8b 27-Dec-1994 Bruce Evans <bde@FreeBSD.org>

Use the same current time throughout ffs_update().

Update some macro names in comments.

Don't use MNT_WAIT for something not related to mounting.


# 901ba606 21-Oct-1994 David Greenman <dg@FreeBSD.org>

Restrict fs_maxfilesize to 2^40, and check against this in ffs_truncate().
This is part of a bug fix from Kirk McKusick to work around problems in FFS
related to the blkno of a 64bit offset not fitting into an int. Note the
proper solution would be to deal with 64bit block numbers, but doing this
would require sweeping changes; some other day perhaps.

Submitted by: Marshall Kirk McKusick


# c1d9efcb 09-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. make gcc less noisy. Still some way to go here.


# 01825ed3 02-Sep-1994 David Greenman <dg@FreeBSD.org>

panic if length is < 0 in ffs_truncate().


# 1cdeb653 29-Aug-1994 David Greenman <dg@FreeBSD.org>

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


# d319b932 03-Aug-1994 David Greenman <dg@FreeBSD.org>

Changed occurrances of "itrunc" to "ffs_truncate" to make Bruce happy.


# 09ce307f 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Completed (hopefully) the kernel support for old style "fastlinks".


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources