History log of /freebsd-current/sys/ufs/ffs/fs.h
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# 831b1ff7 27-Jul-2023 Kirk McKusick <mckusick@FreeBSD.org>

UFS/FFS: Migrate to modern uintXX_t from u_intXX_t.

As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.

No functional changes intended.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# 6f0ca273 26-Jul-2023 Kirk McKusick <mckusick@FreeBSD.org>

Add diagnostics to fsck_ffs(8) for journaled soft-updates debugging.

MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation


# e4a905d1 25-May-2023 Kirk McKusick <mckusick@FreeBSD.org>

Add the ability to adjust directory depths to background fsck_ffs(8).

Commit fe5e6e2 improved FFS directory placement when creating new
directories. It is done by keeping track of the depth of directories
in the filesystem and placing those lower in the tree closer together
while spreading out those higher in the tree.

Fsck_ffs(8) checks these depths and if incorrect adjusts them to
their correct value. When running in background fsck_ffs(8) needs
to be able to make an adjustment to the depth. This commit adds
the sysctl to make such an adjustment and adds the code to fsck_ffs(8)
to use the new sysctl.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 0a6e34e9 15-May-2023 Kirk McKusick <mckusick@FreeBSD.org>

Fix size differences between architectures of the UFS/FFS CGSIZE macro value.

The cylinder group header structure ended with `u_int8_t cg_space[1]'
representing the beginning of the inode bitmap array. Some architectures
like the i386 rounded this up to a 4-byte boundry while other
architectures like the amd64 rounded it up to an 8-byte boundry.
Thus sizeof(struct cg) was four bytes bigger on an amd64 machine
than on an i386 machine. If a filesystem created on an i386 machine
was moved to an amd64 machine, the size of the cylinder group
calculated by the CGSIZE macro would appear to grow by four bytes.
Filesystems whose cylinder groups were exactly equal to the block
size on an i386 machine would appear to have a cylinder group that
was four bytes too big when moved to an amd64 machine. Note that
although the structure appears to be too big, it in fact is fine.
It is just the calaculation of its size that is in error.

The fix is to remove the cg_space element from the cylinder-group
structure so that the calculated size of the structure is the same
size on all architectures.

Reported by: Tijl Coosemans
Tested by: Tijl Coosemans and Peter Holm
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 243a0eda 21-Oct-2022 Kirk McKusick <mckusick@FreeBSD.org>

Increase the maximum size of the journaled soft-updates journal.

The size of the journaled soft-updates journal should be big enough
to hold two minutes of filesystem metadata-update activity. The
maximum size of the soft updates journal was set in the 1990s. At
the time it was assummed that disk arrays would top out at 16 drives
and disk writes per drive would top out at 500 per second. Today's
I/O subsystems are considerably bigger and faster than those limits.
Thus this delta removes the hard upper limit and lets tunefs(8) and
newfs(8) set the upper bound based on the size of the filesystem and
its cylinder groups.

Sponsored by: The FreeBSD Foundation


# e6886616 13-Aug-2022 Kirk McKusick <mckusick@FreeBSD.org>

Move the ability to search for alternate UFS superblocks from fsck_ffs(8)
into ffs_sbsearch() to allow use by other parts of the system.

Historically only fsck_ffs(8), the UFS filesystem checker, had code
to track down and use alternate UFS superblocks. Since fsdb(8) used
much of the fsck_ffs(8) implementation it had some ability to track
down alternate superblocks.

This change extracts the code to track down alternate superblocks
from fsck_ffs(8) and puts it into a new function ffs_sbsearch() in
sys/ufs/ffs/ffs_subr.c. Like ffs_sbget() and ffs_sbput() also found
in ffs_subr.c, these functions can be used directly by the kernel
subsystems. Additionally they are exported to the UFS library,
libufs(8) so that they can be used by user-level programs. The new
functions added to libufs(8) are sbfind(3) that is an alternative
to sbread(3) and sbsearch(3) that is an alternative to sbget(3).
See their manual pages for further details.

The utilities that have been changed to search for superblocks are
dumpfs(8), fsdb(8), ffsinfo(8), and fsck_ffs(8). Also, the prtblknos(8)
tool found in tools/diag/prtblknos searches for superblocks.

The UFS specific mount code uses the superblock search interface
when mounting the root filesystem and when the administrator doing
a mount(8) command specifies the force flag (-f). The standalone UFS
boot code (found in stand/libsa/ufs.c) uses the superblock search
code in the hope of being able to get the system up and running so
that fsck_ffs(8) can be used to get the filesystem cleaned up.

The following utilities have not been changed to search for
superblocks: clri(8), tunefs(8), snapinfo(8), fstyp(8), quot(8),
dump(8), fsirand(8), growfs(8), quotacheck(8), gjournal(8), and
glabel(8). When these utilities fail, they do report the cause of
the failure. The one exception is the tasting code used to try and
figure what a given disk contains. The tasting code will remain
silent so as not to put out a slew of messages as it trying to taste
every new mass storage device that shows up.

Reviewed by: kib
Reviewed by: Warner Losh
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D36053
Sponsored by: The FreeBSD Foundation


# d22531d5 31-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Identify each UFS/FFS superblock integrity check as a warning or fatal error.

Identify each of the superblock validation checks as either a
warning or a fatal error. Any integrity check that can cause a
system hang or crash is marked as fatal. Those that may simply
lead to poor file layoutor other less good operating conditions
are marked as warning.

Normally both fatal and warning are treated as errors and prevent
the superblock from being loaded. A new flag, UFS_NOWARNFAIL, is
added. When passed to ffs_sbget() it will note warnings that it
finds, but will still proceed with loading the superblock. Note
that when UFS_NOWARNFAIL is used, it also includes UFS_NOHASHFAIL.

No legitimate superblocks should fail as a result of these changes.


# b21582ee 30-Jul-2022 Kirk McKusick <mckusick@FreeBSD.org>

Add a flags parameter to the ffs_sbget() function that reads UFS superblocks.

Rather than trying to shoehorn flags into the requested superblock
address, create a separate flags parameter to the ffs_sbget()
function in sys/ufs/ffs/ffs_subr.c. The ffs_sbget() function is
used both in the kernel and in user-level utilities through export
to the sbget() function in the libufs(3) library (see sbget(3)
for details). The kernel uses ffs_sbget() when mounting UFS
filesystems, in the glabel(8) and gjournal(8) GEOM utilities,
and in the standalone library used when booting the system
from a UFS root filesystem.

The ffs_sbget() function reads the superblock located at the byte
offset specified by its sblockloc parameter. The value UFS_STDSB
may be specified for sblockloc to request that the standard
location for the superblock be read.

The two existing options are now flags:

UFS_NOHASHFAIL will note if the check hash is wrong but will still
return the superblock. This is used by the bootstrap code to
give the system a chance to come up so that fsck can be run to
correct the problem.

UFS_NOMSG indicates that superblock inconsistency error messages
should not be printed. It is used by programs like fsck that
want to print their own error message and programs like glabel(8)
that just want to know if a UFS filesystem exists on a partition.

One additional flag is added:

UFS_NOCSUM causes only the superblock itself to be returned, but does
not read in any auxiliary data structures like the cylinder group
summary information. It is used by clients like glabel(8) that
just want to check for possible filesystem types. Using UFS_NOCSUM
skips the superblock checks for csum data which allows superblocks
that have corrupted csum data to be read and used.

The validate_sblock() function checks that the superblock has not
been corrupted in a way that can crash or hang the system. Unless
the UFS_NOMSG flag is specified, it will print out any errors that
it finds. Prior to this commit, validate_sblock() returned as soon
as it found an inconsistency so would print at most one message.
It now does all its checks so when UFS_NOMSG has not been specified
will print out everything that it finds inconsistent.

Sponsored by: The FreeBSD Foundation


# f2b39152 15-Nov-2021 Kirk McKusick <mckusick@FreeBSD.org>

Add ability to suppress UFS/FFS superblock check-hash failure messages.

When reading UFS/FFS superblocks that have check hashes, both the kernel
and libufs print an error message if the check hash is incorrect. This
commit adds the ability to request that the error message not be made.
It is intended for use by programs like fsck that wants to print its
own error message and by kernel subsystems like glabel that just wants
to check for possible filesystem types.

This capability will be used in followup commits.

Sponsored by: Netflix


# b366ee48 14-Nov-2021 Kirk McKusick <mckusick@FreeBSD.org>

Consolodate four copies of the STDSB define into a single place.

The STDSB macro is passed to the ffs_sbget() routine to fetch a
UFS/FFS superblock "from the stadard place". It was identically defined
in lib/libufs/libufs.h, stand/libsa/ufs.c, sys/ufs/ffs/ffs_extern.h,
and sys/ufs/ffs/ffs_subr.c. Delete it from these four files and
define it instead in sys/ufs/ffs/fs.h. All existing uses of this macro
already include sys/ufs/ffs/fs.h so no include changes need to be made.

No functional change intended.

Sponsored by: Netflix


# 996d40f9 24-Oct-2020 Kirk McKusick <mckusick@FreeBSD.org>

Various new check-hash checks have been added to the UFS filesystem
over various major releases. Superblock check hashes were added for
the 12 release and cylinder-group and inode check hashes will appear
in the 13 release.

When a disk with a UFS filesystem is writably mounted, the kernel
clears the feature flags for anything that it does not support. For
example, if a UFS disk from a 12-stable kernel is mounted on an
11-stable system, the 11-stable kernel will clear the flag in the
filesystem superblock that indicates that superblock check-hashs
are being maintained. Thus if the disk is later moved back to a
12-stable system, the 12-stable system will know to ignore its
incorrect check-hash.

If the only filesystem modification done on the earlier kernel is
to run a utility such as growfs(8) that modifies the superblock but
neither updates the check-hash nor clears the feature flag indicating
that it does not support the check-hash, the disk will fail to mount
if it is moved back to its original newer kernel.

This patch moves the code that clears the filesystem feature flags
from the mount code (ffs_mountfs()) to the code that reads the
superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount
code and is imported into libufs(3), all the filesystem utilities
will now also clear these flags when they make modifications to the
filesystem.

As suggested by John Baldwin, fsck_ffs(8) has been changed to accept
and repair bad superblock check-hashes rather than refusing to run.
This change allows fsck to recover filesystems that have been impacted
by utilities older than those created after this change and is a
sensible thing to do in any event.

Reported by: John Baldwin (jhb@)
MFC after: 2 weeks
Sponsored by: Netflix


# 34816cb9 18-Jun-2020 Kirk McKusick <mckusick@FreeBSD.org>

Move the pointers stored in the superblock into a separate
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.

This cleanup of the superblock eases the task of the commit
that immediately follows this one.

Suggested by: brooks
Reviewed by: kib
PR: 246983
Sponsored by: Netflix


# f2620e9c 21-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Retire two unused background fsck sysctls.

These two sysctls were added to support UFS softupdates journalling
with snapshots. However, the changes to fsck to use them were never
committed and there have never been any in-tree uses of these sysctls.

More details from Kirk:

When journalling got added to soft updates, its journal rollback freed
blocks that it thought were no longer in use. But it does not take
snapshots into account (i.e., if a snapshot is still using it, then it
cannot be freed). So I added the needed logic to fsck by having the
free go through the kernel's blkfree code so it could grab blocks that
were still needed by snapshots. That is done using the setbufoutput
hack. I never got that code working reliably, so it is still sitting
in my work directory. Which also explains why you still cannot take
snapshots on filesystems running with journalling...

In looking over my use of this feature, and in particular the troubles
I was having with it, I conclude that it may be better to extract the
code from the kernel that handles freeing blocks claimed by snapshots
and putting it into fsck directly. My original intent was that it is
complex and at the time changing, so only having to maintain it in one
place was appealing. But at this point it has not changed in years and
the hacks like setinode and setbufoutput to be able to use the kernel
code is sufficiently ugly, that I am leaning towards just extracting
it.

Reviewed by: mckusick
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24484


# 5a0d467f 13-Aug-2019 Kirk McKusick <mckusick@FreeBSD.org>

Clarify comment that describes how the FS_METACKHASH is managed.

MFC after: 3 days


# ac4b20a0 25-Feb-2019 Kirk McKusick <mckusick@FreeBSD.org>

After a crash, a file that extends into indirect blocks may end up
shorter than its size resulting in a hole as its final block (which
is a violation of the invarients of the UFS filesystem).

Soft updates will always ensure that the file size is correct when
writing inodes to disk for files that contain only direct block
pointers. However soft updates does not roll back sizes for files
with indirect blocks that it has set to unallocated because their
contents have not yet been written to disk. Hence, the file can
appear to have a hole at its end because the block pointer has been
rolled back to zero when its inode was written to disk. Thus,
fsck_ffs calculates the last allocated block in the file. For files
that extend into indirect blocks, fsck_ffs checks for a size past
the last allocated block of the file and if that is found, shortens
the file to reference the last allocated block thus avoiding having
it reference a hole at its end.

Submitted by: Chuck Silvers <chs@netflix.com>
Tested by: Chuck Silvers <chs@netflix.com>
MFC after: 1 week
Sponsored by: Netflix


# ec888383 23-Oct-2018 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to the superblock. If a check hash fails when an attempt
is made to mount a filesystem, the mount fails with EINVAL (Invalid
argument). This avoids a class of filesystem panics related to
corrupted superblocks. The hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 068beacf 08-Feb-2018 Kirk McKusick <mckusick@FreeBSD.org>

The goal of this change is to prevent accidental foot shooting by
folks running filesystems created on check-hash enabled kernels
(which I will call "new") on a non-check-hash enabled kernels (which
I will call "old). The idea here is to detect when a filesystem is
run on an old kernel and flag the filesystem so that when it gets
moved back to a new kernel, it will not start getting a slew of
check-hash errors.

Back when the UFS version 2 filesystem was created, it added a file
flag FS_INDEXDIRS that was to be set on any filesystem that kept
some sort of on-disk indexing for directories. The idea was precisely
to solve the issue we have today. Specifically that a newer kernel
that supported indexing would be able to tell that the filesystem
had been run on an older non-indexing kernel and that the indexes
should not be used until they had been rebuilt. Since we have never
implemented on-disk directory indicies, the FS_INDEXDIRS flag is
cleared every time any UFS version 2 filesystem ever created is
mounted for writing.

This commit repurposes the FS_INDEXDIRS flag as the FS_METACKHASH
flag. Thus, the FS_METACKHASH is definitively known to have always
been cleared. The FS_INDEXDIRS flag has been moved to a new block
of flags that will always be cleared starting with this commit
(until they get used to implement some future feature which needs
to detect that the filesystem was mounted on a kernel that predates
the new feature).

If a filesystem with check-hashes enabled is mounted on an old
kernel the FS_METACKHASH flag is cleared. When that filesystem is
mounted on a new kernel it will see that the FS_METACKHASH has been
cleared and clears all of the fs_metackhash flags. To get them
re-enabled the user must run fsck (in interactive mode without the
-y flag) which will ask for each supported check hash whether it
should be rebuilt and enabled. When fsck is run in its default preen
mode, it will just ignore the check hashes so they will remain
disabled.

The kernel has always disabled any check hash functions that it
does not support, so as more types of check hashes are added, we
will get a non-surprising result. Specifically if filesystems get
moved to kernels supporting fewer of the check hashes, those that
are not supported will be disabled. If the filesystem is moved back
to a kernel with more of the check-hashes available and fsck is run
interactively to rebuild them, then their checking will resume.
Otherwise just the smaller subset will be checked.

A side effect of this commit is that filesystems running with
cylinder-group check hashes will stop having them checked until
fsck is run to re-enable them (since none of them currently have
the FS_METACKHASH flag set). So, if you want check hashes enabled
on your filesystems after booting a kernel with these changes, you
need to run fsck to enable them. Any newly created filesystems will
have check hashes enabled. If in doubt as to whether you have check
hashes emabled, run dumpfs and look at the list of enabled flags
at the end of the superblock details.


# dffce215 25-Jan-2018 Kirk McKusick <mckusick@FreeBSD.org>

Refactoring of reading and writing of the UFS/FFS superblock.
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.

No functional change intended.

Reviewed by: kib


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 75e3597a 21-Sep-2017 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.

sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)

sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)

sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)


# 855662c6 04-Sep-2017 Kirk McKusick <mckusick@FreeBSD.org>

The new fsck recovery information to enable it to find backup
superblocks created in revision 322297 only works on disks
with sector sizes up to 4K. This update allows the recovery
information to be created by newfs and used by fsck on disks
with sector sizes up to 64K. Note that FFS currently limits
filesystem to be mounted from disks with up to 8K sectors.
Expanding this limitation will be the subject of another
commit.

Reported by: Peter Holm
Reviewed with: kib


# 77b63aa0 08-Aug-2017 Kirk McKusick <mckusick@FreeBSD.org>

Since the switch to GPT disk labels, fsck for UFS/FFS has been
unable to automatically find alternate superblocks. This checkin
places the information needed to find alternate superblocks to the
end of the area reserved for the boot block.

Filesystems created with a newfs of this vintage or later will
create the recovery information. If you have a filesystem created
prior to this change and wish to have a recovery block created for
your filesystem, you can do so by running fsck in forground mode
(i.e., do not use the -p or -y options). As it starts, fsck will
ask ``SAVE DATA TO FIND ALTERNATE SUPERBLOCKS'' to which you should
answer yes.

Discussed with: kib, imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11589


# 3cf259c3 05-May-2017 Ed Maste <emaste@FreeBSD.org>

UFS fs.h: clear warning from use in makefs(1)

makefs(1) has a number of signedness warnings (when built with higher
WARNS), most of which can be addressed by careful application of casts
in makefs itself.

There is one case where a signedness warning arises from the blksize
macro, so must be addressed in the macro itself.

Reviewed by: kib, mckusick
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10589


# 1dc349ab 15-Feb-2017 Ed Maste <emaste@FreeBSD.org>

prefix UFS symbols with UFS_ to reduce namespace pollution

Specifically:
ROOTINO -> UFS_ROOTINO
WINO -> UFS_WINO
NXADDR -> UFS_NXADDR
NDADDR -> UFS_NDADDR
NIADDR -> UFS_NIADDR
MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by: kib, mckusick
Obtained from: NetBSD
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9536


# 949b56ba 06-Sep-2016 Warner Losh <imp@FreeBSD.org>

Renumber the advertising clause.


# 08e94183 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

UFS: spelling fixes on comments.

No functional change.


# 4896af9f 01-Mar-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

ufs: small formatting fixes.

Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after: 3 days
Reviewed by: mckusick


# baa12a84 22-Mar-2013 Kirk McKusick <mckusick@FreeBSD.org>

The purpose of this change to the FFS layout policy is to reduce the
running time for a full fsck. It also reduces the random access time
for large files and speeds the traversal time for directory tree walks.

The key idea is to reserve a small area in each cylinder group
immediately following the inode blocks for the use of metadata,
specifically indirect blocks and directory contents. The new policy
is to preferentially place metadata in the metadata area and
everything else in the blocks that follow the metadata area.

The size of this area can be set when creating a filesystem using
newfs(8) or changed in an existing filesystem using tunefs(8).
Both utilities use the `-k held-for-metadata-blocks' option to
specify the amount of space to be held for metadata blocks in each
cylinder group. By default, newfs(8) sets this area to half of
minfree (typically 4% of the data area).

This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker

Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


# 213a71c6 18-Nov-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix build of kdump(1).


# 1848286a 18-Nov-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add UFS writesuspension mechanism, designed to allow userland processes
to modify on-disk metadata for filesystems mounted for write.

Reviewed by: kib, mckusick
Sponsored by: FreeBSD Foundation


# 549f62fa 30-Oct-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix problem with geom_label(4) not recognizing UFS labels on filesystems
extended using growfs(8). The problem here is that geom_label checks if
the filesystem size recorded in UFS superblock is equal to the provider
(i.e. device) size. This check cannot be removed due to backward
compatibility. On the other hand, in most cases growfs(8) cannot set
fs_size in the superblock to match the provider size, because, differently
from newfs(8), it cannot recompute cylinder group sizes.

To fix this problem, add another superblock field, fs_providersize, used
only for this purpose. The geom_label(4) will attach if either fs_size
(filesystem created with newfs(8)) or fs_providersize (filesystem expanded
using growfs(8)) matches the device size.

PR: kern/165962
Reviewed by: mckusick
Sponsored by: FreeBSD Foundation


# 58b1333a 09-Nov-2011 Gleb Kurtsou <gleb@FreeBSD.org>

Use implementation independent inoNN_t scalars for on-disk UFS structures

Approved by: mdf (mentor)


# 927a12ae 15-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.


# 280e091a 10-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# 455a6e0f 11-Feb-2011 Konstantin Belousov <kib@FreeBSD.org>

Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with: pho
Reviewed by: jeff
Tested by: bz, pho


# 8c2a54de 28-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Add kernel side support for BIO_DELETE/TRIM on UFS.

The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.

Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.

Based on the patch by: mckusick
Reviewed by: mckusick, pjd
Tested by: pho
MFC after: 1 month


# fae5c47d 11-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by: jeff
Tested by: pho


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# c0b2efce 14-Sep-2010 Kirk McKusick <mckusick@FreeBSD.org>

Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by: jeff Roberson


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 0718d64d 18-Apr-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r200796:

Implement NFSv4 ACL support for UFS.

Reviewed by: rwatson


# 4179ce18 26-Feb-2010 Kirk McKusick <mckusick@FreeBSD.org>

MFC of 203763, 203764, 203768, 203769, 203770, 203782, and 203784.

These fixes correct a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

These fixes also update newfs to ensure that it will never create a
filesystem with more than 2^32 inodes.

They also update libufs, tunefs, and growfs so that they properly
handle inode numbers as unsigned.

Reported by: Scott Burns, John Kilburg, and Bruce Evans
Followup by: Jeff Roberson
PR: 133980


# 81479e68 11-Feb-2010 Kirk McKusick <mckusick@FreeBSD.org>

One last pass to get all the unsigned comparisons correct.


# e870d1e6 10-Feb-2010 Kirk McKusick <mckusick@FreeBSD.org>

This fix corrects a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

To fully handle this problem, newfs must be updated to ensure that
it will never create a filesystem with more than 2^32 inodes. That
patch will be forthcoming soon.

Reported by: Scott Burns, John Kilburg, Bruce Evans
Followup by: Jeff Roberson
PR: 133980
MFC after: 2 weeks


# e268f54c 11-Jan-2010 Kirk McKusick <mckusick@FreeBSD.org>

Background:

When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by: jeff
Security issues: rwatson


# 9340fc72 21-Dec-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Implement NFSv4 ACL support for UFS.

Reviewed by: rwatson


# a6b691a8 17-Dec-2008 Sam Leffler <sam@FreeBSD.org>

Apply the big hammer:
o remove all of compat except for pwcache and strstuftoll; these might
end up in libutil or similar so keep them in the subdir
o mv getid.c up to the top level; this looks like something that'll be
makefs-specific
o eliminate private versions of .h files in sys; use system files instead
o eliminate private ffs_tables.c; use the system version directly (might
want to adopt const'ification at some point but that's the only diff I
can see)
o mv remaining code from sys to ffs and strip out unused bits; this now
becomes part of makefs
o add compat defs and shims to makefs.h
o strip all vestiges of nbtool_config.h, compat_defs.h, etc.
o fixup includes after file shuffling
o rename system #defines that do implicit byte swapping to have an _swap
suffix; e.g. DIRSIZ -> DIRSIZ_SWAP, cg_inosused -> cg_inosused_swap; if
we ever add endian-agnostic support to the kernel these can go back to
their original names
o strip some netbsd'isms that aren't worth shim'ing (e.g. _DIAGASSERT)

Code compiles w/o complaints but is untested.


# b6842439 23-Nov-2008 Sam Leffler <sam@FreeBSD.org>

prepare makefs for import to base


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8e7a2353 24-May-2008 Craig Rodrigues <rodrigc@FreeBSD.org>

Fix comments to replace SBSIZE with SBLOCKSIZE, since SBSIZE
was renamed to SBLOCKSIZE in version 1.33

Reviewed by: mckusick


# 1a60c7fc 31-Oct-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


# a16baf37 20-Feb-2005 Xin LI <delphij@FreeBSD.org>

The recomputation of file system summary at mount time can be a
very slow process, especially for large file systems that is just
recovered from a crash.

Since the summary is already re-sync'ed every 30 second, we will
not lag behind too much after a crash. With this consideration
in mind, it is more reasonable to transfer the responsibility to
background fsck, to reduce the delay after a crash.

Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to
control this behavior. When set to nonzero, we will get the
"old" behavior, that the summary is computed immediately at mount
time.

Add five new sysctl variables to adjust ndir, nbfree, nifree,
nffree and numclusters respectively. Teach fsck_ffs about these
API, however, intentionally not to check the existence, since
kernels without these sysctls must have recomputed the summary
and hence no adjustments are necessary.

This change has eliminated the usual tens of minutes of delay of
mounting large dirty volumes.

Reviewed by: mckusick
MFC After: 1 week


# f2aa1113 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Mark the struct fs members that require the ufsmount mutex.
- Define some macros for manipulating the fs_active bitmap.

Sponsored By: Isilon Systems, Inc.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 894d8d3c 09-Oct-2004 Nate Lawson <njl@FreeBSD.org>

Fix fsbtodb() for UFS1. This fixes an overflow for file sizes >1 TB,
allowing for sizes up to 4 TB. This doesn't affect UFS2 since b is already
a 64 bit type, coincidental with daddr_t.

Submitted by: bde


# b72ea57f 19-Aug-2004 John Baldwin <jhb@FreeBSD.org>

Generalize the UFS bad magic value used to determine when a filesystem
has only been partly initialized via newfs(8) so that it applies to both
UFS1 and UFS2.

Submitted by: "Xin LI" delphij at frontfree dot net
MFC: maybe?


# b4a1d929 31-May-2004 Kirill Ponomarev <krion@FreeBSD.org>

- Fix typo

Approved by: tobez


# 012d4134 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.

Approved by: core, rwatson


# b1fddb23 03-Apr-2004 Maxime Henrion <mux@FreeBSD.org>

Fix the remaining warnings of growfs(8) on my sparc64 box with
WARNS=6. I don't change the WARNS level in the Makefile because I
didn't tested this on other archs.

The fs.h fix was suggested by: marcel
Reviewed by: md5(1)


# ec52df8e 16-Nov-2003 Wes Peters <wes@FreeBSD.org>

Write the UFS2 superblock with a 'BAD' magic number at the beginning
of newfs, to signify the newfs operation has not yet completed. Re-
write the superblock with the correct magic number once all of the
cylinder groups have been created to show the operation has finished.

Sponsored by: St. Bernard Software


# d60682c2 21-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a bug in the logical block calculation macros so
that they convert to 64-bit values before shifting rather than
afterwards. Once fixed, they can be used rather than inline expanded.

Sponsored by: DARPA & NAI Labs.


# cc4a8583 09-Jan-2003 Marcel Moolenaar <marcel@FreeBSD.org>

o Improve wording of the comment that accompanies fs_pad. The
padding is not specific to non-i386 architectures. It is
caused by non-i386 specific alignment requirements of
fs_swuid,
o Add a CTASSERT to catch a change in the size of struct fs
at compile-time rather than run-time.

Ok'd: gordon
Tested on: i386 ia64


# 963cae78 09-Jan-2003 Gordon Tetlow <gordon@FreeBSD.org>

Fix superblock alignment problems on non-i386 platforms. Also change fs_uuid
to fs_swuid, making it more descriptive.

Submitted by: marcel
Reviewed by: peter
Pointy hat to: gordon


# 291871da 08-Jan-2003 Gordon Tetlow <gordon@FreeBSD.org>

Steal some space from fs_fsmnt to create fs_volname and fs_uuid. The volname
will be used to support volume names with the help of a GEOM module (to be
committed). uuid will be used to deal with conflicting volume names (which
doesn't work just yet).

Approved by: mckusick@


# ada981b2 26-Nov-2002 Kirk McKusick <mckusick@FreeBSD.org>

Create a new 32-bit fs_flags word in the superblock. Add code to move
the old 8-bit fs_old_flags to the new location the first time that the
filesystem is mounted by a new kernel. One of the unused flags in
fs_old_flags is used to indicate that the flags have been moved.
Leave the fs_old_flags word intact so that it will work properly if
used on an old kernel.

Change the fs_sblockloc superblock location field to be in units
of bytes instead of in units of filesystem fragments. The old units
did not work properly when the fragment size exceeeded the superblock
size (8192). Update old fs_sblockloc values at the same time that
the flags are moved.

Suggested by: BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk>
Sponsored by: DARPA & NAI Labs.


# 3ceef565 14-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Define two new superblock file system flags:

FS_ACLS Administrative enable/disable of extended ACL support
FS_MULTILABEL Administrative flag to indicate to the MAC Framework
that objects in the file system are individually
labeled using extended attributes.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Reviewed by: (in principal) mckusick, phk


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# 7110af75 12-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

ARGH! SBLOCK is not unused. Try to get this right.

BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant).

Define SBLOCK again, using the right math.

Sponsored by: DARPA & NAI Labs.


# 7cb71b74 12-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove #define for BBOFF, it is assumed == 0 so many places that we might
as well forget about it. In fact the only thing which used it was the
SBOFF macro.

Sponsored by: DARPA & NAI Labs.


# 16910634 12-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused BBLOCK and SBLOCK #defines.

Sponsored by: DARPA & NAI Labs.


# a463023d 03-Apr-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h>

Sponsored by: DARPA & NAI Labs.


# 99bef878 17-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26
which caused incomplete snapshots to be taken. When background
fsck would run on these snapshots, the result would be files
being incorrectly released which would subsequently panic the
kernel with ``handle_workitem_freefile: inodedep survived'',
``handle_written_inodeblock: live inodedep'', and
``handle_workitem_remove: lost inodedep'' errors.


# f305c5d1 18-Dec-2001 Kirk McKusick <mckusick@FreeBSD.org>

Change the atomic_set_char to atomic_set_int and atomic_clear_char
to atomic_clear_int to ease the implementation for the sparc64.

Requested by: Jake Burkholder <jake@locore.ca>


# 3fa4044e 16-Dec-2001 Ian Dowse <iedowse@FreeBSD.org>

Move the new superblock field `fs_active' into the region of the
superblock that is already set up to handle pointer types. This
fixes an accidental change in the superblock size on 64-bit platforms
caused by revision 1.24.


# cc5a9233 13-Dec-2001 Kirk McKusick <mckusick@FreeBSD.org>

Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.


# 815d14dd 15-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.


# 9ccb939e 08-May-2001 Kirk McKusick <mckusick@FreeBSD.org>

When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.


# 1a6a6610 13-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
* Filesystem flags.
*
* Note that the FS_NEEDSFSCK flag is set and cleared only by the
* fsck utility. It is set when background fsck finds an unexpected
* inconsistency which requires a traditional foreground fsck to be
* run. Such inconsistencies should only be found after an uncorrectable
* disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
* it has successfully cleaned up the filesystem. The kernel uses this
* flag to enforce that inconsistent filesystems be mounted read-only.
*/
#define FS_UNCLEAN 0x01 /* filesystem not clean at mount */
#define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */
#define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */


# a61ab64a 10-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.

------

One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.

First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:

1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
test is at wd1. Size of test file system is 8 Gb, number of cg=991,
size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
from Dec 2000 with BUFCACHEPERCENT=35

2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50

You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html

Test Results

tar -xzf ports.tar.gz rm -rf ports
mode old dirpref new dirpref speedup old dirprefnew dirpref speedup
First system
normal 667 472 1.41 477 331 1.44
async 285 144 1.98 130 14 9.29
sync 768 616 1.25 477 334 1.43
softdep 413 252 1.64 241 38 6.34
Second system
normal 329 81 4.06 263.5 93.5 2.81
async 302 25.7 11.75 112 2.26 49.56
sync 281 57.0 4.93 263 90.5 2.9
softdep 341 40.6 8.4 284 4.76 59.66

"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.

------

Algorithm description

The old dirpref algorithm is described in comments:

/*
* Find a cylinder to place a directory.
*
* The policy implemented by this algorithm is to select from
* among those cylinder groups with above the average number of
* free inodes, the one with the smallest number of directories.
*/

A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.

What I mean by a big file system ?

1. A big filesystem is a filesystem which occupy 20-30 or more percent
of total drive space, i.e. first and last cylinder are physically
located relatively far from each other.
2. It has a relatively large number of cylinder groups, for example
more cylinder groups than 50% of the buffers in the buffer cache.

The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.

My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
* Find a cylinder group to place a directory.
*
* The policy implemented by this algorithm is to allocate a
* directory inode in the same cylinder group as its parent
* directory, but also to reserve space for its files inodes
* and data. Restrict the number of directories which may be
* allocated one after another in the same cylinder group
* without intervening allocation of files.
*
* If we allocate a first level directory then force allocation
* in another cylinder group.
*/

My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.

My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.

The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:

int32_t fs_avgfilesize; /* expected average file size */
int32_t fs_avgfpdir; /* expected # of files per directory */

These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.

I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.

Obtained from: Grigoriy Orlov <gluk@ptci.ru>


# 812b1d41 20-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Add kernel support for running fsck on active filesystems.


# 589c7af9 07-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).


# f55ff3f3 15-Jan-2001 Ian Dowse <iedowse@FreeBSD.org>

The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by: mckusick


# 22e5a623 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Get userland visible flags added for snapshots to give a few days
advance preparation for them to get migrated into place so that
subsequent changes in utilities will not fail to compile for lack
of up-to-date header files in /usr/include.


# 584508a7 16-Mar-2000 Kirk McKusick <mckusick@FreeBSD.org>

Use 64-bit math to calculate if we have hit our freespace limit.
Necessary for coherent results on filesystems bigger than 0.5Tb.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 90b42d7e 23-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Fixed corrupted newline and corrupted tab in previous commit.


# 8f89943e 23-Mar-1997 Guido van Rooij <guido@FreeBSD.org>

Add generation number randomization. Newly created filesystems wil now
automatically have random generation numbers. The kenel way of handling those
also changed. Further it is advised to run fsirand on all your nfs exported
filesystems. the code is mostly copied from OpenBSD, with the randomization
chanegd to use /dev/urandom
Reviewed by: Garrett
Obtained from: OpenBSD


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# ccb17497 12-Oct-1996 Bruce Evans <bde@FreeBSD.org>

Fixed lblktosize(). It overflowed at 2G. This bug only affected
ufs_read() and ufs_write().

Found by: looking at warnings for comparing the result of lblktosize()
(which is usually daddr_t = long) with file sizes (which are u_quad_t
for ufs). File sizes should probably be off_t's to avoid warnings
when the are compared with file offsets, so the fixed lblktosize()
casts to off_t instead of u_quad_t.

Added definition of smalllblksize(). It is the same as the old
lblksize() and is more efficient for small block numbers on 32-bit
machines.

Use smalllblktosize() instead of its expansion in blksize() and
dblksize(). This keeps the line length short and makes it more
obvious that the shift can't overflow.


# e1eec28a 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 6c5e9bbd 30-Jan-1996 Mike Pritchard <mpp@FreeBSD.org>

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 07be9bd4 10-Mar-1995 David Greenman <dg@FreeBSD.org>

Increased default minfree to 8%.


# eb9fb78c 21-Aug-1994 Paul Richards <paul@FreeBSD.org>

Made idempotent
Reviewed by:
Submitted by:


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources