Cross Reference: /freebsd-current/sys/ufs/ffs/ffs

History log of /freebsd-current/sys/ufs/ffs/ffs_alloc.c
Revision	Date	Author	Comments
# 35a30155	03-Dec-2023	Kirk McKusick <mckusick@FreeBSD.org>	Increase UFS/FFS maximum link count from 32767 to 65530. The link count for a UFS/FFS inode is stored in a signed 16-bit integer. Thus the maximum link count has been 32767. This limit has been recently hit by the poudriere build system when doing a ports build as it needs one directory per port and the number of ports recently passed 32767. A long-term solution would be to use one of the spare 32-bit fields in the inode to store the link count. However, the UFS1 format does not have a spare and adding the spare in UFS2 would make it hard to make it compatible when running on older kernels that use the original link count field. So this patch uses the much simpler approach of changing the existing link count field from a signed 16-bit value to an unsigned 16-bit value. It has the fewest lines of code changes. The only thing that changes is the type in the dinode and inode structures and the definition of UFS_LINK_MAX. It has the added benefit that it works with both UFS1 and UFS2. It allows easy backward compatibility. Indeed it is backward compatibility that is the primary reason to go with this approach. If a filesystem with the new organization is mounted on an older kernel, it still needs to work. Thus if we move the new link count to a new field, we still need to maintain the old link count as best as possible even when running on a kernel that knows about the larger link counts. And we would have to carry this overhead for the indefinite future. If we have a new link-count field, we will have to add a new filesystem flag to indicate that we are running with larger link counts. We will also need to add of one of the new-feature flags to say that we have larger link counts. Older kernels clear the new-feature flags that they do not know about, so when a filesystem is used on an older kernel and then moved back to a newer one, the newer one will know that the new link counts have not been maintained and that it will be necessary to run a full fsck on the filesystem to correct the link counts before it can be mounted. With this change, older kernels will generally work with the bigger counts. While it will not itself allow the link count to exceed 32767, it will have no problem working with inodes that have a link count greater than 32767. Since it tests that i_nlink <= UFS_LINK_MAX, counts that are bigger than 32767 will appear negative, so will still pass the test. Of course, if they ever drop below 32767, they will no longer be able to exceed 32767. The one issue is if the link count ever exceeds 65535 then it will wrap to zero and the older kernel will be none the wiser. But this corner case is likely to be very rare since these kernels and the applications running on them do not expect to be able to get link counts over 32767. And over time, the use of new filesystems on older kernels will become rarer and rarer. Reported-by: Mark Millard running poudriere on the ports tree Reviewed-by: kib, olce.freebsd_certner.fr Tested-by: Peter Holm, Mark Millard MFC-after: 2 weeks Differential Revision: https://reviews.freebsd.org/D42767
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 220427da	12-Aug-2023	Kirk McKusick <mckusick@FreeBSD.org>	Set UFS/FFS file type to snapshot before changing its block pointers. A UFS/FFS snapshot file is identified with the SF_SNAPSHOT flag to identify it as a snapshot. This flag needs to be set before setting some of its block pointers to the special values BLK_SNAP and BLK_NOCOPY. If the snapshot creation fails and we call VOP_REMOVE(), the SF_SNAPSHOT flag will let the remove routine know that the special block pointer values need to be rolled back before attempting deletion of the file. Also ensure that an fsck is required after setting superblock values in the ffs_checkcgintegrity() routine. Reported-by: Peter Holm Tested-by: Peter Holm MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# c3046779	11-Aug-2023	Kirk McKusick <mckusick@FreeBSD.org>	Optimize operations on UFS/FFS filesystems with bad cylinder group(s). If a UFS/FFS filesystem develops a broken cylinder group (which is usually detected when its check hash fails), that cylinder group will not be usable until the filesystem has been unmounted and fsck has been run to repair it. On the first attempt to to allocate resources from the broken cylinder group, its available resources are set to zero in the superblock summary information. Since it will appear to have no resources available, no further calls will be made to allocate resources from it. When resources are freed to the broken cylinder group, the resource free routines will find the cylinder group unusable so the resource will simply be discarded and thus will not show up in the superblock summary information until they are recovered by fsck. Reported-by: Peter Holm Tested-by: Peter Holm MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# 67702352	10-Aug-2023	Kirk McKusick <mckusick@FreeBSD.org>	Cleanups to UFS/FFS ffs_checkblk(). Rename to ffs_checkfreeblk() to better describe that it is checking to find out if a block or fragment is free. Clarify its implementation. No functional change intended. MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# 6dff61a1	08-Aug-2023	Kirk McKusick <mckusick@FreeBSD.org>	Rate limit kernel UFS/FFS cylinder group check-hash error messages. When a large file is deleted or a large number of files are deleted, even a single cylinder group with a bad check hash can generate thousands of check-hash warnings. As with other filesystem messages such as out-of-space, print a maximum of one check-hash error per second. Note the limit is per filesystem. If two filesystems have cylinder group(s) with a bad check hash, each will print a maximum of one check-hash error message per second. MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# d4a8f5bf	07-Aug-2023	Kirk McKusick <mckusick@FreeBSD.org>	Handle UFS/FFS file deletion from cylinder groups with check-hash failure. When a file is deleted, its blocks need to be put back in the free block list and its inode needs to be put back in the inode free list. These lists reside in cylinder-group maps. If either some of its blocks or its inode reside in a cylinder-group map with a bad check hash it is not possible to free the associated resource. Since the cylinder group cannot be repaired until the filesystem is unmounted these resources cannot be freed. They simply accumulate in memory. And any attempt to unmount the filesystem loops forever trying to flush them. With this change, the resource update claims to succeed so that the file deletion can successfully complete. The filesystem is marked as requiring an fsck so that before the next time that the filesystem is mounted, the offending cylinder groups are reconstructed causing the lost resources to be reclaimed. A better solution would be to downgrade the filesystem to read-only, but that capability is not currently implemented. Reported-by: Peter Holm Tested-by: Peter Holm MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# 831b1ff7	27-Jul-2023	Kirk McKusick <mckusick@FreeBSD.org>	UFS/FFS: Migrate to modern uintXX_t from u_intXX_t. As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html move to the modern uintXX_t. While here also migrate u_char to uint8_t. Where other kernel interfaces allow, migrate u_long to uint64_t. No functional changes intended. MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# 4a344442	25-Jul-2023	Kirk McKusick <mckusick@FreeBSD.org>	Comment cleanup. MFC-after: 1 week Sponsored-by: The FreeBSD Foundation
# ba8cc6d7	12-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use __enum_uint8 for vtype and vstate This whacks hackery around only reading v_type once. Bump __FreeBSD_version to 1400093
# e4a905d1	25-May-2023	Kirk McKusick <mckusick@FreeBSD.org>	Add the ability to adjust directory depths to background fsck_ffs(8). Commit fe5e6e2 improved FFS directory placement when creating new directories. It is done by keeping track of the depth of directories in the filesystem and placing those lower in the tree closer together while spreading out those higher in the tree. Fsck_ffs(8) checks these depths and if incorrect adjusts them to their correct value. When running in background fsck_ffs(8) needs to be able to make an adjustment to the depth. This commit adds the sysctl to make such an adjustment and adds the code to fsck_ffs(8) to use the new sysctl. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# fe5e6e2c	29-Mar-2023	Kirk McKusick <mckusick@FreeBSD.org>	Improvement in UFS/FFS directory placement when doing mkdir(2). The algorithm for laying out new directories was devised in the 1980s and markedly improved the performance of the filesystem. In those days large disks had at most 100 cylinder groups and often as few as 10-20. Modern multi-terrabyte disks have thousands of cylinder groups. The original algorithm does not handle these large sizes well. This change attempts to expand the scope of the original algorithm to work well with these much larger disks while still retaining the properties of the original algorithm for small disks. The filesystem implementation is divided into policy routines and implementation routines. The policy routines can be changed in any way desired without risk of corrupting the filesystem. The policy requests are handled by the implementation layer. If the policy asks for an available resource, it is granted. But if it asks for an already in-use resource, then the implementation will provide an available one nearby the request. Thus it is impossible for a policy to double allocate. This change is limited to the policy implementation. This change updates the ffs_dirpref() routine which is responsible for selecting the cylinder group into which a new directory should be placed. If we are near the root of the filesystem we aim to spread them out as much as possible. As we descend deeper from the root we cluster them closer together around their parent as we expect them to be more closely interactive. Higher-level directories like usr/src/sys and usr/src/bin should be separated while the directories in these areas are more likely to be accessed together so should be closer. And directories within commands or kernel subsystems should be closer still. We pick a range of cylinder groups around the cylinder group of the directory in which we are being created. The size of the range for our search is based on our depth from the root of our filesystem. We then probe that range based on how many directories are already present. The first new directory is at 1/2 (middle) of the range; the second is in the first 1/4 of the range, then at 3/4, 1/8, 3/8, 5/8, 7/8, 1/16, 3/16, 5/16, etc. It is desirable to store the depth of a directory in its on-disk inode so that it is available when we need it. We add a new field di_dirdepth to track the depth of each directory. Because there are few spare fields left in the inode, we choose to share an existing field in the inode rather than having one of our own. Specifically we create a union with the di_freelink field. The di_freelink field is used to track inodes that have been unlinked but remain referenced. It is not needed until a rmdir(2) operation has been done on a directory. At that point, the directory has no contents and even if it is kept active as a current directory is no longer able to have any new directories or files created in it. Thus the use of di_dirdepth and di_freelink will never coincide. Reported by: Timo Voelker Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D39246
# 6b9d4fbb	13-Aug-2022	Kirk McKusick <mckusick@FreeBSD.org>	Explicitly initialize rather than reading newly allocated UFS inodes. The function ffs_vgetf() is used to find or load UFS inodes into a vnode. It first looks up the inode and if found in the cache its vnode is returned. If it is not already in the cache, a new vnode is allocated and its associated inode read in from the disk. The read is done even for inodes that are being initially created. The contents for the inode on the disk are assumed to be empty. If the on-disk contents had been corrupted either due to a hardware glitch or an agent deliberately trying to exploit the system, the UFS code could panic from the unexpected partially-allocated inode. Rather then having fsck_ffs(8) verify that all unallocated inodes are properly empty, it is easier and quicker to add a flag to ffs_vgetf() to indicate that the request is for a newly allocated inode. When set, the disk read is skipped and the inode is set to its expected empty (zero'ed out) initial state. As a side benefit, an unneeded disk I/O is avoided. Reported by: Peter Holm Sponsored by: The FreeBSD Foundation
# 064e6b43	13-Jul-2022	Kirk McKusick <mckusick@FreeBSD.org>	Rewrite function definitions in the UFS/FFS code base with identifier lists. The K&R style in UFS and other places in the tree's days are numbered as this syntax is removed in C2x proposal N2432: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf Though running to nearly 6000 lines of diffs this update should cause no functional change to the code. Requested by: Warner Losh MFC after: 2 weeks
# 2733b242	27-Mar-2022	Gordon Bergling <gbe@FreeBSD.org>	ffs(3): Fix a common typo in source code comments - s/quadradically/quadratically/ Obtained from: NetBSD MFC after: 3 days
# 99aa3b73	27-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	ffs: lock buffers after snaplk with LK_NOWITNESS Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34073
# e11b2b69	27-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	ffs_alloc.c: order includes alphabetically Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34073
# f9af3151	05-Dec-2021	Gordon Bergling <gbe@FreeBSD.org>	Revert "ffs(3): Fix a typo in a sysctl description" It should be - s/contigous/contiguous/ not continuous Reported by: tuexen@ This reverts commit 42efe994ece337929b6f9937829f7970e620db9b.
# 42efe994	03-Dec-2021	Gordon Bergling <gbe@FreeBSD.org>	ffs(3): Fix a typo in a sysctl description - s/contigous/continuous/ MFC after: 3 days
# 76b05e3e	01-Nov-2021	Konstantin Belousov <kib@FreeBSD.org>	ffs: Remove assertions about locked um_devvp in several places Namely, ffs_blkfree_cg(), and ffs_flushfiles(). Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32761
# 9acea164	02-Oct-2021	Robert Wing <rew@FreeBSD.org>	ffs: retire unused fsckpid mount option The fsckpid mount option was introduced in 927a12ae16433b50 along with a couple sysctl's to support SU+J with snapshots. However, those sysctl's were never used and eventually removed in f2620e9ceb3ede02. There are no in-tree consumers of this mount option. Reviewed by: mckusick, kib Differential Revision: https://reviews.freebsd.org/D32015
# d4d289cd	21-May-2021	Konstantin Belousov <kib@FreeBSD.org>	ffs: mark block (re-)allocations as seqc writes Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041
# b2f95756	31-May-2021	Mark Johnston <markj@FreeBSD.org>	ffs: Correct the input size check in sysctl_ffs_fsck() Make sure we return an error if no input was specified, since SYSCTL_IN() will report success in that case. Reported by: KMSAN Reviewed by: mckusick MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30586
# cc9958bf	19-Feb-2021	Konstantin Belousov <kib@FreeBSD.org>	ffs_reallocblks: change the guard for softdep_prealloc() call to DOINGSUJ() instead of DOINGSOFTDEP(). The softdep_prealloc() function does nothing in SU case. Note that the call should be safe with regard to the vnode relock, because it is called with MNT_NOWAIT, which does not descend into fsync. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 6b3a9a0f	11-Jan-2021	Mateusz Guzik <mjg@FreeBSD.org>	Convert remaining cap_rights_init users to cap_rights_init_one semantic patch: @@ expression rights, r; @@ - cap_rights_init(&rights, r) + cap_rights_init_one(&rights, r)
# 61846fc4	13-Nov-2020	Konstantin Belousov <kib@FreeBSD.org>	Add a framework that tracks exclusive vnode lock generation count for UFS. This count is memoized together with the lookup metadata in directory inode, and we assert that accesses to lookup metadata are done under the same lock generation as they were stored. Enabled under DIAGNOSTICS. UFS saves additional data for parent dirent when doing lookup (i_offset, i_count, i_endoff), and this data is used later by VOPs operating on dirents. If parent vnode exclusive lock is dropped and re-acquired between lookup and the VOP call, we corrupt directories. Framework asserts that corruption cannot occur that way, by tracking vnode lock generation counter. Updates to inode dirent members also save the counter, while users compare current and saved counters values. Also, fix a case in ufs_lookup_ino() where i_offset and i_count could be updated under shared lock. It is not a bug on its own since dvp i_offset results from such lookup cannot be used, but it causes false positive in the checker. In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136
# 96474d2a	15-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Do not copy vp into f_data for DTYPE_VNODE files. The pointer to vnode is already stored into f_vnode, so f_data can be reused. Fix all found users of f_data for DTYPE_VNODE. Provide finit_vnode() helper to initialize file of DTYPE_VNODE type. Reviewed by: markj (previous version) Discussed with: freqlabs (openzfs chunk) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346
# d90f2c36	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	ufs: clean up empty lines in .c and .h files
# a92a971b	16-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the thread argument from vget It was already asserted to be curthread. Semantic patch: @@ expression arg1, arg2, arg3; @@ - vget(arg1, arg2, arg3) + vget(arg1, arg2)
# 52488b51	04-Jun-2020	Kirk McKusick <mckusick@FreeBSD.org>	Further evaluation of the POSIX spec for fdatasync() shows that it requires that new data on growing files be accessible. Thus, the the fsyncdata() system call must update the on-disk inode when the size of the file has changed. This commit adds another inode update flag, IN_SIZEMOD, that gets set any time that the file size changes. If either the IN_IBLKDATA or the IN_SIZEMOD flag is set when fdatasync() is called, the associated inode is synchronously written to disk. We could have overloaded the IN_IBLKDATA flag to also track size changes since the only (current) use case for these flags are for fsyncdata(), but it does seem useful for possible future uses to separately track the file size changes and the inode block pointer changes. Reviewed by: kib MFC with: -r361785 Differential revision: https://reviews.freebsd.org/D25072
# d79ff54b	25-May-2020	Chuck Silvers <chs@FreeBSD.org>	This commit enables a UFS filesystem to do a forcible unmount when the underlying media fails or becomes inaccessible. For example when a USB flash memory card hosting a UFS filesystem is unplugged. The strategy for handling disk I/O errors when soft updates are enabled is to stop writing to the disk of the affected file system but continue to accept I/O requests and report that all future writes by the file system to that disk actually succeed. Then initiate an asynchronous forced unmount of the affected file system. There are two cases for disk I/O errors: - ENXIO, which means that this disk is gone and the lower layers of the storage stack already guarantee that no future I/O to this disk will succeed. - EIO (or most other errors), which means that this particular I/O request has failed but subsequent I/O requests to this disk might still succeed. For ENXIO, we can just clear the error and continue, because we know that the file system cannot affect the on-disk state after we see this error. For EIO or other errors, we arrange for the geom_vfs layer to reject all future I/O requests with ENXIO just like is done when the geom_vfs is orphaned. In both cases, the file system code can just clear the error and proceed with the forcible unmount. This new treatment of I/O errors is needed for writes of any buffer that is involved in a dependency. Most dependencies are described by a structure attached to the buffer's b_dep field. But some are created and processed as a result of the completion of the dependencies attached to the buffer. Clearing of some dependencies require a read. For example if there is a dependency that requires an inode to be written, the disk block containing that inode must be read, the updated inode copied into place in that buffer, and the buffer then written back to disk. Often the needed buffer is already in memory and can be used. But if it needs to be read from the disk, the read will fail, so we fabricate a buffer full of zeroes and pretend that the read succeeded. This zero'ed buffer can be updated and written back to disk. The only case where a buffer full of zeros causes the code to do the wrong thing is when reading an inode buffer containing an inode that still has an inode dependency in memory that will reinitialize the effective link count (i_effnlink) based on the actual link count (i_nlink) that we read. To handle this case we now store the i_nlink value that we wrote in the inode dependency so that it can be restored into the zero'ed buffer thus keeping the tracking of the inode link count consistent. Because applications depend on knowing when an attempt to write their data to stable storage has failed, the fsync(2) and msync(2) system calls need to return errors if data fails to be written to stable storage. So these operations return ENXIO for every call made on files in a file system where we have otherwise been ignoring I/O errors. Coauthered by: mckusick Reviewed by: kib Tested by: Peter Holm Approved by: mckusick (mentor) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24088
# 71d11ee3	22-May-2020	John Baldwin <jhb@FreeBSD.org>	Update name of description of vfs.ffs.setsize in comment. Previously it used the name 'adjsize' instead of 'setsize'.
# f2620e9c	21-Apr-2020	John Baldwin <jhb@FreeBSD.org>	Retire two unused background fsck sysctls. These two sysctls were added to support UFS softupdates journalling with snapshots. However, the changes to fsck to use them were never committed and there have never been any in-tree uses of these sysctls. More details from Kirk: When journalling got added to soft updates, its journal rollback freed blocks that it thought were no longer in use. But it does not take snapshots into account (i.e., if a snapshot is still using it, then it cannot be freed). So I added the needed logic to fsck by having the free go through the kernel's blkfree code so it could grab blocks that were still needed by snapshots. That is done using the setbufoutput hack. I never got that code working reliably, so it is still sitting in my work directory. Which also explains why you still cannot take snapshots on filesystems running with journalling... In looking over my use of this feature, and in particular the troubles I was having with it, I conclude that it may be better to extract the code from the kernel that handles freeing blocks claimed by snapshots and putting it into fsck directly. My original intent was that it is complex and at the time changing, so only having to maintain it in one place was appealing. But at this point it has not changed in years and the hacks like setinode and setbufoutput to be able to use the kernel code is sufficiently ugly, that I am leaning towards just extracting it. Reviewed by: mckusick MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D24484
# d2222aa0e	07-Mar-2020	Mateusz Guzik <mjg@FreeBSD.org>	fd: use smr for managing struct pwd This has a side effect of eliminating filedesc slock/sunlock during path lookup, which in turn removes contention vs concurrent modifications to the fd table. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23889
# f15ccf88	06-Mar-2020	Chuck Silvers <chs@FreeBSD.org>	Add a new "mntfs" pseudo file system which provides private device vnodes for file systems to safely access their disk devices, and adapt FFS to use it. Also add a new BO_NOBUFS flag to allow enforcing that file systems using mntfs vnodes do not accidentally use the original devfs vnode to create buffers. Reviewed by: kib, mckusick Approved by: imp (mentor) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23787
# 8d03b99b	01-Mar-2020	Mateusz Guzik <mjg@FreeBSD.org>	fd: move vnodes out of filedesc into a dedicated structure The new structure is copy-on-write. With the assumption that path lookups are significantly more frequent than chdirs and chrooting this is a win. This provides stable root and jail root vnodes without the need to reference them on lookup, which in turn means less work on globally shared structures. Note this also happens to fix a bug where jail vnode was never referenced, meaning subsequent access on lookup could run into use-after-free. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23884
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# ac4ec141	12-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	ufs: add a setter for inode i_flag field This will be used later to add vnodes to the lazy list. Reviewed by: kib (previous version), jeff Tested by: pho (in a larger patch) Differential Revision: https://reviews.freebsd.org/D22994
# b249ce48	03-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427
# d00066a5	03-Dec-2019	Kirk McKusick <mckusick@FreeBSD.org>	Currently the breadn_flags() and getblkx() interfaces are passed the vnode, logical block number, and size of data block that is being requested. They then use the VOP_BMAP function to calculate the mapping from logical block number to physical block number from which to access the data. This change expands the interface to also pass the physical block number in cases where the VOP_MAP function may no longer work, for example when a file is being truncated. No functional change. Reviewed by: kib Tested by: Peter Holm Sponsored by: Netflix
# 2bcfb938	25-Nov-2019	Chuck Silvers <chs@FreeBSD.org>	In ffs_freefile(), use a separate variable to hold the inode number within the cg rather than reusuing "ino" for this purpose. This reduces the diff for an upcoming change that improves handling of I/O errors. No functional change. Reviewed by: mckusick Approved by: mckusick (mentor) Sponsored by: Netflix
# 44d37182	03-Oct-2019	Kirk McKusick <mckusick@FreeBSD.org>	Update ffs_getcg() function to accept a flags parameter to be passed to breadn_flags() in preparation for later need when doing forcible unmount when disk dies or is removed. No functional change. Sponsored by: Netflix
# f3cf6225	06-Sep-2019	Conrad Meyer <cem@FreeBSD.org>	ufs: Remove redundant brelse() after r294954 Same automation. No functional change.
# 16040222	29-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	UFS: stop reusing the vnode for reallocated inode. In ffs_valloc(), force reclaim existing vnode on inode reuse, instead of trying to re-initialize the same vnode for new purposes. This is done in preparation of changes to the vp->v_object lifecycle handling. A new FFSV_REPLACE flag to ffs_vgetf() directs the function to vgone(9) the vnode if found in vfs hash, instead of returning it. Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21412
# fdf34aa3	17-Jul-2019	Kirk McKusick <mckusick@FreeBSD.org>	The error reported in FS-14-UFS-3 can only happen on UFS/FFS filesystems that have block pointers that are out-of-range for their filesystem. These out-of-range block pointers are corrected by fsck(8) so are only encountered when an unchecked filesystem is mounted. A new "untrusted" flag has been added to the generic mount interface that can be set when mounting media of unknown provenance or integrity. For example, a daemon that automounts a filesystem on a flash drive when it is plugged into a system. This commit adds a test to UFS/FFS that validates all block numbers before using them. Because checking for out-of-range blocks adds unnecessary overhead to normal operation, the tests are only done when the filesystem is mounted as an "untrusted" filesystem. Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE Reported as: FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg) Reviewed by: kib Sponsored by: Netflix
# 1fd136ec	16-Jul-2019	Kirk McKusick <mckusick@FreeBSD.org>	When a process attempts to allocate space on a full filesystem, a filesystem full message is sent to the offending process or the kernel log if the offending process cannot be identified. To prevent an explotion of messages, the kernel ppsratecheck() function is used to limit the messages to one per second. This revision changes the variable that tracks the rate of these messages from a systemwide limit to a per-filesystem limit by moving it from a global variable to a variable in the ufsmount structure. Suggested by: kib Reviewed by: kib Sponsored by: Netflix
# f89d2072	17-Jun-2019	Xin LI <delphij@FreeBSD.org>	Separate kernel crc32() implementation to its own header (gsb_crc32.h) and rename the source to gsb_crc32.c. This is a prerequisite of unifying kernel zlib instances. PR: 229763 Submitted by: Yoshihiro Ota <ota at j.email.ne.jp> Differential Revision: https://reviews.freebsd.org/D20193
# af6aeacb	28-May-2019	Kirk McKusick <mckusick@FreeBSD.org>	Convert use of UFS-specific #ifdef DEBUG to DIAGNOSTIC or INVARIANTS as appropriate. No functional change intended. Suggested-by: markj
# a1304030	06-Apr-2019	Mariusz Zaborski <oshogbo@FreeBSD.org>	Introduce funlinkat syscall that always us to check if we are removing the file associated with the given file descriptor. Reviewed by: kib, asomers Reviewed by: cem, jilles, brooks (they reviewed previous version) Discussed with: pjd, and many others Differential Revision: https://reviews.freebsd.org/D14567
# ac4b20a0	25-Feb-2019	Kirk McKusick <mckusick@FreeBSD.org>	After a crash, a file that extends into indirect blocks may end up shorter than its size resulting in a hole as its final block (which is a violation of the invarients of the UFS filesystem). Soft updates will always ensure that the file size is correct when writing inodes to disk for files that contain only direct block pointers. However soft updates does not roll back sizes for files with indirect blocks that it has set to unallocated because their contents have not yet been written to disk. Hence, the file can appear to have a hole at its end because the block pointer has been rolled back to zero when its inode was written to disk. Thus, fsck_ffs calculates the last allocated block in the file. For files that extend into indirect blocks, fsck_ffs checks for a size past the last allocated block of the file and if that is found, shortens the file to reference the last allocated block thus avoiding having it reference a hole at its end. Submitted by: Chuck Silvers <chs@netflix.com> Tested by: Chuck Silvers <chs@netflix.com> MFC after: 1 week Sponsored by: Netflix
# cc426dd3	11-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation
# bdd6b77e	05-Dec-2018	Kirk McKusick <mckusick@FreeBSD.org>	If the vfs.ffs.dotrimcons sysctl option is enabled while a file deletion is active, specifically after a call to ffs_blkrelease_start() but before the call to ffs_blkrelease_finish(), ffs_blkrelease_start() will have handed out SINGLETON_KEY rather than starting a collection sequence. Thus if we get a SINGLETON_KEY passed to ffs_blkrelease_finish(), we just return rather than trying to finish the nonexistent sequence. Reported by: Warner Losh (imp@) Sponsored by: Netflix
# 4f77f488	25-Oct-2018	Konstantin Belousov <kib@FreeBSD.org>	Implement O_BENEATH and AT_BENEATH. Flags prevent open(2) and *at(2) vfs syscalls name lookup from escaping the starting directory. Supposedly the interface is similar to the same proposed Linux flags. Reviewed by: jilles (code, previous version of manpages), 0mp (manpages) Discussed with: allanjude, emaste, jonathan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17547
# 4b6a2c49	06-Sep-2018	Kirk McKusick <mckusick@FreeBSD.org>	The Call For Testing had no reports of operational problems and found that performance was no worse and usually better when running with TRIM consolidation. Performance improvement was most noticable when multiple large files are released in a short period of time. Thus, TRIM consolidation is being enabled by default. Should operational problems be found, it can be disabled using the command `sysctl vfs.ffs.dotrimcons=0'. This variable can also be set as a tunable if early disabling is necessary. Approved by: re (gjb) Sponsored by: Netflix
# a9c2220f	23-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	Proper spelling of consolidation. Submitted by: Dimitry Andric
# d530fd48	20-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	TRIM consolodation is supposed to be off by default
# 4de0d16b	19-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	For traditional disks, the filesystem attempts to allocate the blocks of a file as contiguously as possible. Since the filesystem does not know how large a file will grow when it is first being written, it initially places the file in a set of blocks in which it currently fits. As it grows, it is relocated to areas with larger contiguous blocks. In this way it saves its large contiguous sets of blocks for the files that need them and thus avoids unnecessaily fragmenting its disk space. We used to skip reallocating the blocks of a file into a contiguous sequence if the underlying flash device requested BIO_DELETE notifications, because devices that benefit from BIO_DELETE also benefit from not moving the data. However, in the algorithm described above that reallocates the blocks, the destination for the data is usually moved before the data is written to the initially allocated location. So we rarely suffer the penalty of extra writes. With the addition of the consolodation of contiguous blocks into single BIO_DELETE operations, having fewer but larger contiguous blocks reduces the number of (slow and expensive) BIO_DELETE operations. So when doing BIO_DELETE consolodation, we do block reallocation. Reviewed by: kib Tested by: Peter Holm Sponsored by: Netflix
# fc6e1715	19-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	Add consolodation of TRIM / BIO_DELETE commands to the UFS/FFS filesystem. When deleting files on filesystems that are stored on flash-memory (solid-state) disk drives, the filesystem notifies the underlying disk of the blocks that it is no longer using. The notification allows the drive to avoid saving these blocks when it needs to flash (zero out) one of its flash pages. These notifications of no-longer-being-used blocks are referred to as TRIM notifications. In FreeBSD these TRIM notifications are sent from the filesystem to the drive using the BIO_DELETE command. Until now, the filesystem would send a separate message to the drive for each block of the file that was deleted. Each Gigabyte of file size resulted in over 3000 TRIM messages being sent to the drive. This burst of messages can overwhelm the drive's task queue causing multiple second delays for read and write requests. This implementation collects runs of contiguous blocks in the file and then consolodates them into a single BIO_DELETE command to the drive. The BIO_DELETE command describes the run of blocks as a single large block being deleted. Each Gigabyte of file size can result in as few as two BIO_DELETE commands and is typically less than ten. Though these larger BIO_DELETE commands take longer to run, they do not clog the drive task queue, so read and write commands can intersperse effectively with them. Though this new feature has been throughly reviewed and tested, it is being added disabled by default so as to minimize the possibility of disrupting the upcoming 12.0 release. It can be enabled by running ``sysctl vfs.ffs.dotrimcons=1''. Users are encouraged to test it. If no problems arise, we will consider requesting that it be enabled by default for 12.0. Reviewed by: kib Tested by: Peter Holm Sponsored by: Netflix
# 7e038bc2	18-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	Replace the TRIM consolodation framework originally added in -r337396 driven by problems found with the algorithms being tested for TRIM consolodation. Reported by: Peter Holm Suggested by: kib Reviewed by: kib Sponsored by: Netflix
# cc91864c	18-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	Revert -r337396. It is being replaced with a revised interface that resulted from testing and further reviews.
# 68c49bcc	06-Aug-2018	Kirk McKusick <mckusick@FreeBSD.org>	Put in place the framework for consolodating contiguous blocks into a smaller number of larger TRIM requests. The hope had been to have the full TRIM consolodation in place for 12.0, but the algorithms are still under development and need further testing. With this framework in place it will be possible to easily add TRIM consolodation once the optimal strategy has been found. The only functional change with this patch is the elimination of TRIM requests for blocks that are freed before they have been likely to have been written. Reviewed by: kib Discussed with: Warner Losh and Chuck Silvers Sponsored by: Netflix
# ab0bcb60	29-Jun-2018	Kirk McKusick <mckusick@FreeBSD.org>	Create um_flags in the ufsmount structure to hold flags for a UFS filesystem. Convert integer structure flags to use um_flags: int um_candelete; /* devvp supports TRIM / int um_writesuspended; / suspension in progress / become: #define UM_CANDELETE 0x00000001 / devvp supports TRIM / #define UM_WRITESUSPENDED 0x00000002 / suspension in progress */ This is in preparation for adding other flags to indicate forcible unmount in progress after a disk failure and possibly forcible downgrade to read-only. No functional change intended. Sponsored by: Netflix
# 06753bd3	25-Jun-2018	Warner Losh <imp@FreeBSD.org>	Use buf + strategy rather than bypassing geom_vfs layer The reference counting that's done in the geom_vfs layer to prevent delivery of requests to defunct devices only works if all requests go through that layer. UFS was bypassing that layer for BIO_DELETE requests, sending them to the geom_consumer directly with g_io_request. Allocate a buf, fill it in, and call strategy on it instead. Submitted by: Chuck Silvers Reviewed by: scottl, imp, kirk Sponsored by: Netflix Differential: https://reviews.freebsd.org/D15456
# 1ef3a74e	19-May-2018	Matt Macy <mmacy@FreeBSD.org>	ufs: remove cgbno variable where unused
# d8ba45e2	16-Mar-2018	Ed Maste <emaste@FreeBSD.org>	Revert r313780 (UFS_ prefix)
# 1e2b9afc	16-Mar-2018	Ed Maste <emaste@FreeBSD.org>	Prefix UFS symbols with UFS_ to reduce namespace pollution Followup to r313780. Also prefix ext2's and nandfs's versions with EXT2_ and NANDFS_. Reported by: kib Reviewed by: kib, mckusick Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D9623
# 47806d1b	05-Feb-2018	Kirk McKusick <mckusick@FreeBSD.org>	Occasional cylinder-group check-hash errors were being reported on systems running with a heavy filesystem load. Tracking down this bug was elusive because there were actually two problems. Sometimes the in-memory check hash was wrong and sometimes the check hash computed when doing the read was wrong. The occurrence of either error caused a check-hash mismatch to be reported. The first error was that the check hash in the in-memory cylinder group was incorrect. This error was caused by the following sequence of events: - We read a cylinder-group buffer and the check hash is valid. - We update its cg_time and cg_old_time which makes the in-memory check-hash value invalid but we do not mark the cylinder group dirty. - We do not make any other changes to the cylinder group, so we never mark it dirty, thus do not write it out, and hence never update the incorrect check hash for the in-memory buffer. - Later, the buffer gets freed, but the page with the old incorrect check hash is still in the VM cache. - Later, we read the cylinder group again, and the first page with the old check hash is still in the VM cache, but some other pages are not, so we have to do a read. - The read does not actually get the first page from disk, but rather from the VM cache, resulting in the old check hash in the buffer. - The value computed after doing the read does not match causing the error to be printed. The fix for this problem is to only set cg_time and cg_old_time as the cylinder group is being written to disk. This keeps the in-memory check-hash valid unless the cylinder group has had other modifications which will require it to be written with a new check hash calculated. It also requires that the check hash be recalculated in the in-memory cylinder group when it is marked clean after doing a background write. The second problem was that the check hash computed at the end of the read was incorrect because the calculation of the check hash on completion of the read was being done too soon. - When a read completes we had the following sequence: - bufdone() -- b_ckhashcalc (calculates check hash) -- bufdone_finish() --- vfs_vmio_iodone() (replaces bogus pages with the cached ones) - When we are reading a buffer where one or more pages are already in memory (but not all pages, or we wouldn't be doing the read), the I/O is done with bogus_page mapped in for the pages that exist in the VM cache. This mapping is done to avoid corrupting the cached pages if there is any I/O overrun. The vfs_vmio_iodone() function is responsible for replacing the bogus_page(s) with the cached ones. But we were calculating the check hash before the bogus_page(s) were replaced. Hence, when we were calculating the check hash, we were partly reading from bogus_page, which means we calculated a bad check hash (e.g., because multiple pages have been mapped to bogus_page, so its contents are indeterminate). The second fix is to move the check-hash calculation from bufdone() to bufdone_finish() after the call to vfs_vmio_iodone() so that it computes the check hash over the correct set of pages. With these two changes, the occasional cylinder-group check-hash errors are gone. Submitted by: David Pfitzner <dpfitzner@netflix.com> Reviewed by: kib Tested by: David Pfitzner
# 0d37a428	31-Jan-2018	Kirk McKusick <mckusick@FreeBSD.org>	When reading a cylinder group, break out reporting of check hash errors from other types of errors so that the error is correctly reported.
# 51268c38	27-Dec-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	SPDX: Complete License ID tags for UFS.
# 151ba793	24-Dec-2017	Alexander Kabaev <kan@FreeBSD.org>	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385
# b6fbf003	09-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Provide a sysctl to force synchronous initialization of inode blocks. FFS performs asynchronous inode initialization, using a barrier write to ensure that the inode block is written before the corresponding cylinder group header update. Some GEOMs do not appear to handle BIO_ORDERED correctly, meaning that the barrier write may not work as intended. The sysctl allows one to work around this problem at the cost of expensive file creation on new filesystems. The default behaviour is unchanged. Reviewed by: kib, mckusick MFC after: 1 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13428
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 4e13cca5	05-Nov-2017	Konstantin Belousov <kib@FreeBSD.org>	Improve the message printed when the cylinder group checksum is wrong. Mention the device path and mount point path, handle snapshots. Tested by: imp Sponsored by: The FreeBSD Foundation
# 55e5a5c1	22-Sep-2017	Konstantin Belousov <kib@FreeBSD.org>	Fix 32bit build. Reported by: emaste Sponsored by: The FreeBSD Foundation
# 75e3597a	21-Sep-2017	Kirk McKusick <mckusick@FreeBSD.org>	Continuing efforts to provide hardening of FFS, this change adds a check hash to cylinder groups. If a check hash fails when a cylinder group is read, no further allocations are attempted in that cylinder group until it has been fixed by fsck. This avoids a class of filesystem panics related to corrupted cylinder group maps. The hash is done using crc32c. Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily used in embedded systems with small memories and low-powered processors which need as light-weight a filesystem as possible. Specifics of the changes: sys/sys/buf.h: Add BX_FSPRIV to reserve a set of eight b_xflags that may be used by individual filesystems for their own purpose. Their specific definitions are found in the header files for each filesystem that uses them. Also add fields to struct buf as noted below. sys/kern/vfs_bio.c: It is only necessary to compute a check hash for a cylinder group when it is actually read from disk. When calling bread, you do not know whether the buffer was found in the cache or read. So a new flag (GB_CKHASH) and a pointer to a function to perform the hash has been added to breadn_flags to say that the function should be called to calculate a hash if the data has been read. The check hash is placed in b_ckhash and the B_CKHASH flag is set to indicate that a read was done and a check hash calculated. Though a rather elaborate mechanism, it should also work for check hashing other metadata in the future. A kernel internal API change was to change breada into a static fucntion and add flags and a function pointer to a check-hash function. sys/ufs/ffs/fs.h: Add flags for types of check hashes; stored in a new word in the superblock. Define corresponding BX_ flags for the different types of check hashes. Add a check hash word in the cylinder group. sys/ufs/ffs/ffs_alloc.c: In ffs_getcg do the dance with breadn_flags to get a check hash and if one is provided, check it. sys/ufs/ffs/ffs_vfsops.c: Copy across the BX_FFSTYPES flags in background writes. Update the check hash when writing out buffers that need them. sys/ufs/ffs/ffs_snapshot.c: Recompute check hash when updating snapshot cylinder groups. sys/libkern/crc32.c: lib/libufs/Makefile: lib/libufs/libufs.h: lib/libufs/cgroup.c: Include libkern/crc32.c in libufs and use it to compute check hashes when updating cylinder groups. Four utilities are affected: sbin/newfs/mkfs.c: Add the check hashes when building the cylinder groups. sbin/fsck_ffs/fsck.h: sbin/fsck_ffs/fsutil.c: Verify and update check hashes when checking and writing cylinder groups. sbin/fsck_ffs/pass5.c: Offer to add check hashes to existing filesystems. Precompute check hashes when rebuilding cylinder group (although this will be done when it is written in fsutil.c it is necessary to do it early before comparing with the old cylinder group) sbin/dumpfs/dumpfs.c Print out the new check hash flag(s) sbin/fsdb/Makefile: Needs to add libufs now used by pass5.c imported from fsck_ffs. Reviewed by: kib Tested by: Peter Holm (pho)
# 35bf7809	16-Jul-2017	Konstantin Belousov <kib@FreeBSD.org>	Remove write-only variable. Tested by: pho Sponsored by: The FreeBSD Foundation
# 51a6a15f	16-Jul-2017	Konstantin Belousov <kib@FreeBSD.org>	A followup to r320453, correct removal of the blocks from UFS snapshots. Tested by: pho PR: 220693 Sponsored by: The FreeBSD Foundation
# 9c4f551e	28-Jun-2017	Kirk McKusick <mckusick@FreeBSD.org>	Create a new function ffs_getcg() to read in and verify a cylinder group. Change all code points that open-coded this functionality to use the new function. This commit is a refactoring with no change in functionality. In the future this change allows more robust checking of cylinder group reads along the lines discussed in the hardening UFS session at BSDCan (retry I/O, add checksums, etc). For more detail see the session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS Reviewed by: kib
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 1dc349ab	15-Feb-2017	Ed Maste <emaste@FreeBSD.org>	prefix UFS symbols with UFS_ to reduce namespace pollution Specifically: ROOTINO -> UFS_ROOTINO WINO -> UFS_WINO NXADDR -> UFS_NXADDR NDADDR -> UFS_NDADDR NIADDR -> UFS_NIADDR MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency) Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_ Reviewed by: kib, mckusick Obtained from: NetBSD MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D9536
# bf9c87c8	19-Sep-2016	Konstantin Belousov <kib@FreeBSD.org>	Be more strict when selecting between snapshot/regular mount. Reclaimed vnode type is VBAD, so succesful comparision like devvp->v_type != VREG does not imply that the devvp references snapshot, it might be due to a reclaimed vnode. Explicitely check the vnode type. In the the most important case of ffs_blkfree(), the devfs vnode is locked and its type is stable. In other cases, if the vnode is reclaimed right after the check, hopefully the buffer methods return right error codes. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# e1db6897	17-Sep-2016	Konstantin Belousov <kib@FreeBSD.org>	Reduce size of ufs inode. Remove redunand i_dev and i_fs pointers, which are available as ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to reduce padding. To compensate added derefences, the most often i_ump access to differentiate between UFS1 and UFS2 dinode layout is removed, by addition of the new i_flag IN_UFS2. Overall, this actually reduces the amount of memory dereferences. On 64bit machine, original struct inode size is 176, reduced to 152 bytes with the change. Tested by: pho (previous version) Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 57d2ac2f	22-May-2016	Kevin Lo <kevlo@FreeBSD.org>	arc4random() returns 0 to (2**32)−1, use an alternative to initialize i_gen if it's zero rather than a divide by 2. With inputs from delphij, mckusick, rmacklem Reviewed by: mckusick
# 08e94183	29-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	UFS: spelling fixes on comments. No functional change.
# abafa4db	10-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	ufs: replace 0 with NULL for pointers. While here also do late initialization of the variables we are changing. Found with devel/coccinelle. Reviewed by: mckusick MFC after: 2 weeks
# c79dff0f	27-Mar-2016	Konstantin Belousov <kib@FreeBSD.org>	Split the global taskqueue used to process all UFS trim completions, into per-mount taskqueue with the private taskqueue processing thread. This allows to drain the taskqueue on unmount, to ensure that all TRIMs are finished before mount structures are freed. But just draining the taskqueue where TRIM biodone geom-up completions are processed is not enough, since ffs_blkfree(), called by the task, might result in more writes. Count inflight delayed blkfree's and pause() unmount until the counter drains as well. Reported by: Nick Evans <nevans@talkpoint.com> Tested by: Nick Evans <nevans@talkpoint.com>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 476dfdc3	15-Oct-2015	Warner Losh <imp@FreeBSD.org>	Do not relocate extents to make them contiguous if the underlying drive can do deletions. Ability to do deletions is a strong indication that this optimization will not help performance. It will only generate extra write traffic. These devices are typically flash based and have a limited number of write cycles. In addition, making the file contiguous in LBA space doesn't improve the access times from flash devices because they have no seek time. Reviewed by: mckusick@
# f0725a8e	11-Jul-2015	Mateusz Guzik <mjg@FreeBSD.org>	Move chdir/chroot-related fdp manipulation to kern_descrip.c Prefix exported functions with pwd_. Deduplicate some code by adding a helper for setting fd_cdir. Reviewed by: kib
# 4da8456f	16-Jun-2015	Mateusz Guzik <mjg@FreeBSD.org>	Replace struct filedesc argument in getvnode with struct thread This is is a step towards removal of spurious arguments.
# 74a87c38	24-Apr-2015	Kirk McKusick <mckusick@FreeBSD.org>	Limit the number of cylinder groups that will be searched when trying to build a cluster. The limit is tunable using the sysctl vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups per block allocation. It was previously limited to the number of cylinder groups in the filesystem per block allocation. When there were no clusters of the needed size left, it repeatedly searched the whole filesystem for a non-existent cluster on every block allocation. The result was very slow filesystem allocation with 100% CPU utilization. The old behavior can be had by setting vfs.ffs.maxclustersearch to a huge number (1,000,000). This change affects only the layout policy routines so is not able to interfere with the integrity of the filesystem. Reported by: Dmitry Sivachenko (demon@) Tested by: Dmitry Sivachenko (demon@) MFC after: 2 weeks
# dde58752	17-Dec-2014	Gleb Kurtsou <gleb@FreeBSD.org>	Adjust printf format specifiers for dev_t and ino_t in kernel. ino_t and dev_t are about to become uint64_t. Reviewed by: kib, mckusick
# e24784d3	22-Mar-2014	Kirk McKusick <mckusick@FreeBSD.org>	Update comment to explain search order reverted to historical order in -r254996. Suggested by: Pedro Giffuni <pfg@FreeBSD.org> MFC: 3 days
# 4a144410	16-Mar-2014	Robert Watson <rwatson@FreeBSD.org>	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks
# 4896af9f	01-Mar-2014	Pedro F. Giffuni <pfg@FreeBSD.org>	ufs: small formatting fixes. Cleanup some extra space. Use of tabs vs. spaces. No functional change. MFC after: 3 days Reviewed by: mckusick
# be1fd823	30-Dec-2013	Kirk McKusick <mckusick@FreeBSD.org>	Fine tune filesystem block allocations under low free-space conditions (-r254995) based on further operational experience. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko MFC after: 2 weeks
# 7008be5b	04-Sep-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation
# 2ce451f0	28-Aug-2013	Kirk McKusick <mckusick@FreeBSD.org>	In looking at block layouts as part of fixing filesystem block allocations under low free-space conditions (-r254995), determine that old block-preference search order used before -r249782 worked a bit better. This change reverts to that block-preference search order. MFC after: 2 weeks
# 28702816	28-Aug-2013	Kirk McKusick <mckusick@FreeBSD.org>	A performance problem was reported in PR kern/181226: I have 25TB Dell PERC 6 RAID5 array. When it becomes almost full (10-20GB free), processes which write data to it start eating 100% CPU and write speed drops below 1MB/sec (normally to gives 400MB/sec). The revision at which it first became apparent was http://svnweb.freebsd.org/changeset/base/249782. The offending change reserved an area in each cylinder group to store metadata. The new algorithm attempts to save this area for metadata and allows its use for non-metadata only after all the data areas have been exhausted. The size of the reserved area defaults to half of minfree, so the filesystem reports full before the data area can completely fill. However, in this report, the filesystem has had minfree reduced to 1% thus forcing the metadata area to be used for data. As the filesystem approached full, it had only metadata areas left to allocate. The result was that every block allocation had to scan summary data for 30,000 cylinder groups before falling back to searching up to 30,000 metadata areas. The fix is to give up on saving the metadata areas once the free space reserve drops below 2%. The effect of this change is to use the old algorithm of just accepting the first available block that we find. Since most filesystems use the default 5% minfree, this will have no effect on their operation. For those that want to push to the limit, they will get their crappy block placements quickly. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko PR: kern/181226 MFC after: 2 weeks
# fdd9a6f4	14-Jul-2013	Kirk McKusick <mckusick@FreeBSD.org>	Update to comments describing block allocation policy. Submitted by: Bruce Evans
# 3f8db5b1	02-Jul-2013	Kirk McKusick <mckusick@FreeBSD.org>	Make better use of metadata area by avoiding using it for data blocks that no should no longer immediately follow their indirect blocks. MFC after: 2 weeks
# baa12a84	22-Mar-2013	Kirk McKusick <mckusick@FreeBSD.org>	The purpose of this change to the FFS layout policy is to reduce the running time for a full fsck. It also reduces the random access time for large files and speeds the traversal time for directory tree walks. The key idea is to reserve a small area in each cylinder group immediately following the inode blocks for the use of metadata, specifically indirect blocks and directory contents. The new policy is to preferentially place metadata in the metadata area and everything else in the blocks that follow the metadata area. The size of this area can be set when creating a filesystem using newfs(8) or changed in an existing filesystem using tunefs(8). Both utilities use the `-k held-for-metadata-blocks' option to specify the amount of space to be held for metadata blocks in each cylinder group. By default, newfs(8) sets this area to half of minfree (typically 4% of the data area). This work was inspired by a paper presented at Usenix's FAST '13: www.usenix.org/conference/fast13/ffsck-fast-file-system-checker Details of this implementation appears in the April 2013 of ;login: www.usenix.org/publications/login/april-2013-volume-38-number-2. A copy of the April 2013 ;login: paper can also be downloaded from: www.mckusick.com/publications/faster_fsck.pdf. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks
# 59a01b70	19-Mar-2013	Konstantin Belousov <kib@FreeBSD.org>	UFS support of the unmapped i/o for the user data buffers. Sponsored by: The FreeBSD Foundation Tested by: pho, scottl, jhb, bf
# 94f4ac21	27-Feb-2013	Konstantin Belousov <kib@FreeBSD.org>	An inode block must not be blockingly read while cg block is owned. The order is inode buffer lock -> snaplk -> cg buffer lock, reversing the order causes deadlocks. Inode block must not be written while cg block buffer is owned. The FFS copy on write needs to allocate a block to copy the content of the inode block, and the cylinder group selected for the allocation might be the same as the owned cg block. The reserved block detection code in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the situation, because the locked cg buffer is not exposed to it. In order to maintain the dependency between initialized inode block and the cg_initediblk pointer, look up the inode buffer in non-blocking mode. If succeeded, brelse cg block, initialize the inode block and write it. After the write is finished, reread cg block and update the cg_initediblk. If inode block is already locked by another thread, let the another thread initialize it. If another thread raced with us after we started writing inode block, the situation is detected by an update of cg_initediblk. Note that double-initialization of the inode block is harmless, the block cannot be used until cg_initediblk is incremented. Sponsored by: The FreeBSD Foundation In collaboration with: pho Reviewed by: mckusick MFC after: 1 month X-MFC-note: after r246877
# 7839b23f	16-Feb-2013	Kirk McKusick <mckusick@FreeBSD.org>	The UFS2 filesystem allocates new blocks of inodes as they are needed. When a cylinder group runs short of inodes, a new block for inodes is allocated, zero'ed, and written to the disk. The zero'ed inodes must be on the disk before the cylinder group can be updated to claim them. If the cylinder group claiming the new inodes were written before the zero'ed block of inodes, the system could crash with the filesystem in an unrecoverable state. Rather than adding a soft updates dependency to ensure that the new inode block is written before it is claimed by the cylinder group map, we just do a barrier write of the zero'ed inode block to ensure that it will get written before the updated cylinder group map can be written. This change should only slow down bulk loading of newly created filesystems since that is the primary time that new inode blocks need to be created. Reported by: Robert Watson Reviewed by: kib Tested by: Peter Holm
# 9604a7f1	10-Feb-2013	Konstantin Belousov <kib@FreeBSD.org>	Fix several unsafe pointer dereferences in the buffered_write() function, implementing the sysctl vfs.ffs.set_bufoutput (not used in the tree yet). - The current directory vnode dereference is unsafe since fd_cdir could be changed and unreferenced, lock the filedesc around and vref the fd_cdir. - The VTOI() conversion of the fd_cdir is unsafe without first checking that the vnode is indeed from an FFS mount, otherwise the code dereferences a random memory. - The cdir could be reclaimed from under us, lock it around the checks. - The type of the fp vnode might be not a disk, or it might have changed while the thread was in flight, check the type. Reviewed and tested by: mckusick MFC after: 2 weeks
# aa7ddc85	03-Nov-2012	Kirk McKusick <mckusick@FreeBSD.org>	When a file is first being written, the dynamic block reallocation (implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks so as to cluster them together into a contiguous set of blocks on the disk. When the cluster crosses the boundary into the first indirect block, the first indirect block is initially allocated in a position immediately following the last direct block. Block reallocation would usually destroy locality by moving the indirect block out of the way to keep the data blocks contiguous. This change compensates for this problem by noting that the first indirect block should be left immediately following the last direct block. It then tries to start a new cluster of contiguous blocks (referenced by the indirect block) immediately following the indirect block. We should also do this for other indirect block boundaries, but it is only important for the first one. Suggested by: Bruce Evans MFC: 2 weeks
# 5050aa86	22-Oct-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# fc8fdae0	27-Sep-2012	Matthew D Fleming <mdf@FreeBSD.org>	Fix up kernel sources to be ready for a 64-bit ino_t. Original code by: Gleb Kurtsou
# c5c1199c	02-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
# 8f8d3027	01-Jan-2012	Ed Schouten <ed@FreeBSD.org>	Migrate ufs and ext2fs from skpc() to memcchr(). While there, remove a useless check from the code. memcchr() always returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff can never be equal to zero. Also, the fact that memcchr() returns a pointer instead of the number of bytes until the end, makes conversion to an offset far more easy.
# 4b3a6fb9	15-Aug-2011	Robert Watson <rwatson@FreeBSD.org>	Fix two cases involving opt_capsicum.h and module builds: (1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the #include. (2) portalfs depends on opt_capsicum.h, so have the Makefile generate one if required. These affect only modules built without a kernel (i.e, not buildkernel, but yes buildworld if the dubious MODULES_WITH_WORLD is used). Approved by: re (bz) Sponsored by: Google Inc
# a9d2f8d8	10-Aug-2011	Robert Watson <rwatson@FreeBSD.org>	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
# fddf7bae	29-Jul-2011	Kirk McKusick <mckusick@FreeBSD.org>	Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP is set so that mount can revert back to using MNT_NOWAIT when doing getmntinfo. Approved by: re (kib)
# 2621f2c4	22-Jul-2011	Kirk McKusick <mckusick@FreeBSD.org>	Default debugging error messages to off for journaled soft updates sysctls. Delete limiting on output of these sysctls. Approved by: re (kib)
# 927a12ae	15-Jul-2011	Kirk McKusick <mckusick@FreeBSD.org>	Add an FFS specific mount option to allow a filesystem checker (typically fsck_ffs) to register that it wishes to use FFS specific sysctl's to update the filesystem. This ensures that two checkers cannot run on a given filesystem at the same time and that no other process accidentally or maliciously uses the filesystem updating sysctls inappropriately. This functionality is needed by the journaling soft-updates recovery code.
# 17ff0cf7	09-Jul-2011	Kirk McKusick <mckusick@FreeBSD.org>	When first creating snapshots, we may free some blocks within it. These blocks should not have TRIM applied to them. Submitted by: Kostik Belousov
# 16f7d822	19-Jun-2011	Jeff Roberson <jeff@FreeBSD.org>	- Fix directory count rollbacks by passing the mode to the journal dep earlier. - Add rollback/forward code for frag and cluster accounting. - Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)
# 43a3cc77	15-Jun-2011	Kirk McKusick <mckusick@FreeBSD.org>	Ensure that filesystem metadata contained within persistent snapshots is always kept consistent. Suggested by: Jeff Roberson
# 2191e465	15-Jun-2011	Kirk McKusick <mckusick@FreeBSD.org>	With the restructuring of the block reclaimation code, the notification messages for a filesystem being out of space need to be moved so that they do not print out until after a failed cleanup attempt. Suggested by: Jeff Roberson
# 9eb8728a	12-Jun-2011	Kirk McKusick <mckusick@FreeBSD.org>	Update to soft updates journaling to properly track freed blocks that get claimed by snapshots. Submitted by: Jeff Roberson Tested by: Peter Holm
# 280e091a	10-Jun-2011	Jeff Roberson <jeff@FreeBSD.org>	Implement fully asynchronous partial truncation with softupdates journaling to resolve errors which can cause corruption on recovery with the old synchronous mechanism. - Append partial truncation freework structures to indirdeps while truncation is proceeding. These prevent new block pointers from becoming valid until truncation completes and serialize truncations. - On completion of a partial truncate journal work waits for zeroed pointers to hit indirects. - softdep_journal_freeblocks() handles last frag allocation and last block zeroing. - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it is only implemented in one place. - Block allocation failure handling moved up one level so it does not proceed with buf locks held. This permits us to do more extensive reclaims when filesystem space is exhausted. - softdep_sync_metadata() is broken into two parts, the first executes once at the start of ffs_syncvnode() and flushes truncations and inode dependencies. The second is called on each locked buf. This eliminates excessive looping and rollbacks. - Improve the mechanism in process_worklist_item() that handles acquiring vnode locks for handle_workitem_remove() so that it works more generally and does not loop excessively over the same worklist items on each call. - Don't corrupt directories by zeroing the tail in fsck. This is only done for regular files. - Push a fsync complete record for files that need it so the checker knows a truncation in the journal is no longer valid. Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts) Tested by: pho
# 9f62b10c	05-Jun-2011	Kirk McKusick <mckusick@FreeBSD.org>	Grammer fix in comment. Eliminate one (of several) possible conflicting buffer locks when trying to reclaim blocks. Rest of fix to be incorporated as part of SUJ update by jeff. Pointed out by: Kostik Belousov
# 1508294b	28-May-2011	Kirk McKusick <mckusick@FreeBSD.org>	Due to a lag in updating the fs_pendinginodes count, we cannot depend on it to decide whether we should try to reclaim inodes when we run short. Discovered by: Peter Holm
# 99f6ac66	26-May-2011	Kirk McKusick <mckusick@FreeBSD.org>	The check for whether a block is going to be claimed by a snapshot needs to happen before we notify the underlying layer that it is being freed.
# d9ca1af7	24-Apr-2011	Konstantin Belousov <kib@FreeBSD.org>	VFS sometimes is unable to inactivate a vnode when vnode use count goes to zero. E.g., the vnode might be only shared-locked at the time of vput() call. Such vnodes are kept in the hash, so they can be found later. If ffs_valloc() allocated an inode that has its vnode cached in hash, and still owing the inactivation, then vget() call from ffs_valloc() clears VI_OWEINACT, and then the vnode is reused for the newly allocated inode. The problem is, the vnode is not reclaimed before it is put to the new use. ffs_valloc() recycles vnode vm object, but this is not enough. In particular, at least v_vflag should be cleared, and several bits of UFS state need to be removed. It is very inconvenient to call vgone() at this point. Instead, move some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(), and call the helper from VOP_RECLAIM and ffs_valloc(). Reviewed by: mckusick Tested by: pho MFC after: 3 weeks
# 4c821a39	05-Apr-2011	Kirk McKusick <mckusick@FreeBSD.org>	Be far more persistent in reclaiming blocks and inodes before giving up and declaring a filesystem out of space. Especially necessary when running on a small filesystem. With this improvement, it should be possible to use soft updates on a small root filesystem. Kudos to: Peter Holm Testing by: Peter Holm MFC: 2 weeks
# 0a809056	22-Mar-2011	Kirk McKusick <mckusick@FreeBSD.org>	Add retry code analogous to the block allocation retry code to avoid running out of inodes. Reported by: Peter Holm
# 96e1934a	04-Mar-2011	John Baldwin <jhb@FreeBSD.org>	Use ffs() to locate free bits in the inode bitmap rather than a loop with bit shifts. Reviewed by: mckusick MFC after: 1 month
# 8c2a54de	28-Dec-2010	Konstantin Belousov <kib@FreeBSD.org>	Add kernel side support for BIO_DELETE/TRIM on UFS. The FS_TRIM fs flag indicates that administrator requested issuing of TRIM commands for the volume. UFS will only send the command to disk if the disk reports GEOM::candelete attribute. Since disk queue is reordered, data block is marked as free in the bitmap only after TRIM command completed. Due to need to sleep waiting for i/o to finish, TRIM bio_done routine schedules taskqueue to set the bitmap bit. Based on the patch by: mckusick Reviewed by: mckusick, pjd Tested by: pho MFC after: 1 month
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 9f9c8c59	06-Jul-2010	Jeff Roberson <jeff@FreeBSD.org>	- Handle the truncation of an inode with an effective link count of 0 in the context of the process that reduced the effective count. Previously all truncation as a result of unlink happened in the softdep flush thread. This had the effect of being impossible to rate limit properly with the journal code. Now the process issuing unlinks is suspended when the journal files. This has a side-effect of improving rm performance by allowing more concurrent work. - Handle two cases in inactive, one for effnlink == 0 and another when nlink finally reaches 0. - Eliminate the SPACECOUNTED related code since the truncation is no longer delayed. Discussed with: mckusick
# 113db2dd	24-Apr-2010	Jeff Roberson <jeff@FreeBSD.org>	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
# 4179ce18	26-Feb-2010	Kirk McKusick <mckusick@FreeBSD.org>	MFC of 203763, 203764, 203768, 203769, 203770, 203782, and 203784. These fixes correct a problem in the file system that treats large inode numbers as negative rather than unsigned. For a default (16K block) file system, this bug began to show up at a file system size above about 16Tb. These fixes also update newfs to ensure that it will never create a filesystem with more than 2^32 inodes. They also update libufs, tunefs, and growfs so that they properly handle inode numbers as unsigned. Reported by: Scott Burns, John Kilburg, and Bruce Evans Followup by: Jeff Roberson PR: 133980
# c84124c0	20-Feb-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r203818: Clear the bp pointer when buffer is already brelse()d.
# 2950ff25	13-Feb-2010	Konstantin Belousov <kib@FreeBSD.org>	When ffs_realloccg() failed to allocate bigger fragment and, because pending blocks are scheduled for removal, goes to retry the (re)allocation, clear the bp pointer. It might happen that meantime free space is really exhausted and we are entering nospace: label without bread()ing buffer, causing stale bp value to be brelse()d again. Tested by: pho (Producing a scenario to reliably reproduce the race appeared to be much harder then fixing the bug) MFC after: 1 week
# e870d1e6	10-Feb-2010	Kirk McKusick <mckusick@FreeBSD.org>	This fix corrects a problem in the file system that treats large inode numbers as negative rather than unsigned. For a default (16K block) file system, this bug began to show up at a file system size above about 16Tb. To fully handle this problem, newfs must be updated to ensure that it will never create a filesystem with more than 2^32 inodes. That patch will be forthcoming soon. Reported by: Scott Burns, John Kilburg, Bruce Evans Followup by: Jeff Roberson PR: 133980 MFC after: 2 weeks
# 53298164	11-Jan-2010	Kirk McKusick <mckusick@FreeBSD.org>	Cast 64-bit quantity to intptr_t rather than int so as to work properly with 64-bit architectures (such as amd64). Reported by: bz
# e268f54c	11-Jan-2010	Kirk McKusick <mckusick@FreeBSD.org>	Background: When renaming a directory it passes through several intermediate states. First its new name will be created causing it to have two names (from possibly different parents). Next, if it has different parents, its value of ".." will be changed from pointing to the old parent to pointing to the new parent. Concurrently, its old name will be removed bringing it back into a consistent state. When fsck encounters an extra name for a directory, it offers to remove the "extraneous hard link"; when it finds that the names have been changed but the update to ".." has not happened, it offers to rewrite ".." to point at the correct parent. Both of these changes were considered unexpected so would cause fsck in preen mode or fsck in background mode to fail with the need to run fsck manually to fix these problems. Fsck running in preen mode or background mode now corrects these expected inconsistencies that arise during directory rename. The functionality added with this update is used by fsck running in background mode to make these fixes. Solution: This update adds three new fsck sysctl commands to support background fsck in correcting expected inconsistencies that arise from incomplete directory rename operations. They are: setcwd(dirinode) - set the current directory to dirinode in the filesystem associated with the snapshot. setdotdot(oldvalue, newvalue) - Verify that the inode number for ".." in the current directory is oldvalue then change it to newvalue. unlink(nameptr, oldvalue) - Verify that the inode number associated with nameptr in the current directory is oldvalue then unlink it. As with all other fsck sysctls, these new ones may only be used by processes with appropriate priviledge. Reported by: jeff Security issues: rwatson
# 6e5982ca	17-May-2009	Alan Cox <alc@FreeBSD.org>	Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg(). In collaboration with: tegge
# 8a3f2c37	06-Feb-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	When a device containing mounted UFS filesystem disappears, the type of devvp becomes VBAD, which UFS incorrectly interprets as snapshot vnode, which in turns causes panic. Fix it by replacing '!= VCHR' with '== VREG'. With this fix in place, you should no longer be able to panic the system by removing a device with an UFS filesystem mounted from it - assuming you don't use softupdates. Reviewed by: kib Tested by: pho Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
# ec7e66e8	27-Jan-2009	Robert Watson <rwatson@FreeBSD.org>	Following a fair amount of real world experience with ACLs and extended attributes since FreeBSD 5, make the following semantic changes: - Don't update the inode modification time (mtime) when extended attributes (and hence also ACLs) are added, modified, or removed. - Don't update the inode access tie (atime) when extended attributes (and hence also ACLs) are queried. This means that rsync (and related tools) won't improperly think that the data in the file has changed when only the ACL has changed. Note that ffs_reallocblks() has not been changed to not update on an IO_EXT transaction, but currently EAs don't use the cluster write routines so this shouldn't be a problem. If EAs grow support for clustering, then VOP_REALLOCBLKS() will need to grow a flag argument to carry down IO_EXT to UFS. MFC after: 1 week PR: ports/125739 Reported by: Alexander Zagrebin <alexz@visp.ru> Tested by: pluknet <pluknet@gmail.com>, Greg Byshenk <freebsd@byshenk.net> Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# acd05e46	28-Aug-2008	Konstantin Belousov <kib@FreeBSD.org>	In ffs_valloc(), ffs_vget() may fail because insmntque() refused to insert new vnode into the mount vnode list. Then, for the SU-enabled mount, ffs_vfree could create freefile dependency. This dependency can hang around forever since inode is not marked as IN_MODIFIED and correspondingly inodeblock may be not marked as dirty. After ffs_vget() fails, retry with FFSV_FORCEINSMQ, mark the inode as modified, and vput() it immediately. Take care of the dup alloc. Tested by: pho Reviewed by: tegge MFC after: 1 month
# d9e6294e	01-Dec-2007	Ken Smith <kensmith@FreeBSD.org>	Fix a broken check that recently became more annoying because it now gets enabled when INVARIANTS is on instead of DIAGNOSTIC (which apparently nobody uses). From Tor's description: This happens when the block range spans two block maps, the first in the inode (mapping up to NDADDR direct blocks) and the second being the first indirect block. The current check assumes that both block maps are indirect blocks. Work done by: tegge Tested by: kris, kensmith
# 1102b89b	08-Nov-2007	David E. O'Brien <obrien@FreeBSD.org>	Turn most ffs 'DIAGNOSTIC's into INVARIANTS.
# 7fd627f0	10-Sep-2007	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix a DIV0 in case a large value for fs_avgfilesize or fs_avgfpdir is given (with newfs or tunefs) and dirsize overflows. In case dirsize is <= 0 because of an overflow set maxcontigdirs to 0 so it will be 1 later. This is what would happen for large fs_avgfilesize. [1] Identified with help from: roberto, pjd Submitted by: pjd [1] Approved by: re (rwatson) MFC after: 8 days
# 32f9753c	11-Jun-2007	Robert Watson <rwatson@FreeBSD.org>	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
# 98fff6b5	23-Feb-2007	Brian Somers <brian@FreeBSD.org>	Account for di_blocks allocations when IN_SPACECOUNTED is set in an inode's i_flag. It's possible that after ufs_infactive() calls softdep_releasefile(), i_nlink stays >0 for a considerable amount of time (> 60 seconds here). During this period, any ffs allocation routines that alter di_blocks must also account for the blocks in the filesystem's fs_pendingblocks value. This change fixes an eventual df/du discrepency that will happen as the result of fs_pendingblocks being reduced to <0. The only manifestation of this that people may recognise is the following message on boot: /somefs: update error: blocks -N files M at which point the negative pending block count is adjusted to zero. Reviewed by: tegge MFC after: 3 weeks
# db9b81ea	20-Jan-2007	Mike Pritchard <mpp@FreeBSD.org>	Quota system cleanup. 1) Do not do quota accounting for the actual quota data files or for file system snapshot files ("system" files). This prevents a deadlock descibed in PR kern/30958 if the kernel ever has to grow the quota file. Snapshot files were already exempt from the quota checks, but this change generalized the check. 2) Fix a cast that caused extremely large uids/gids to incorrectly write the quota information to the data file at a truncated value for a uint_t32 id value. The incorrect cast caused quota files in this case to be around 4GB in size, with the correct cast they can now be 131GB in size. Also related to PR kern/30958. 3) Check for what appear to be negative UIDs/GIDs and not account for them. This prevents the quota files from becoming 131GB in size and causing quotacheck to run forever at bootup. This could also cause the kernel to try and expand the quota file, which might deadlock due to the issue in #1. kern/30958 and kern/38156 (and some much older closed PR's). 4) With the deadlock problems gone, the kernel can now expand the size of the quota database files if it needs to. 5) Pass in the i-node count change value to chkiq and chkiqchg as an int, like it used to be before the common routine was split up into 2 different routines to increase / decrease the i-node in-use count. Prevents an underflow on the i-node count. Related to PR kern/89247. 6) Prevent the block usage from growing slowly if a file system is full and the write was denied due to that fact. PR kern/89247. Some of these changes require an updated quotacheck to prevent the creation of huge (131GB) quota data files (item #3). #1/#4 probably fixes a lot of the random hangs when quotas are enabled, possibly some of the jail hangs.
# 6192525b	16-Jan-2007	Mike Pritchard <mpp@FreeBSD.org>	Fix a spelling error in some comments. heirarchy -> hierarchy. Obtained from: OpenBSD
# acd3428b	06-Nov-2006	Robert Watson <rwatson@FreeBSD.org>	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# 2b8c9fa4	18-Jul-2006	Stefan Farfeleder <stefanf@FreeBSD.org>	Drop two unnecessary casts.
# eb2ea105	01-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Move softdep from using a global worklist to per-mount worklists. This has many positive effects including improved smp locking, reducing interdependencies between mounts that can lead to deadlocks, etc. - Add the softdep worklist and various counters to the ufsmnt structure. - Add a mount pointer to the workitem and remove mount pointers from the various structures derived from the workitem as they are now redundant. - Remove the poor-man's semaphore protecting softdep_process_worklist and softdep_flushworklist. Several threads may now process the list simultaneously. - Add softdep_waitidle() to block the thread until all pending dependencies being operated on by other threads have been flushed. - Use softdep_waitidle() in unmount and snapshots to block either operation until the fs is stable. - Remove softdep worklist processing from the syncer and move it into the softdep_flush() thread. This thread processes all softdep mounts once each second and when it is called via the new softdep_speedup() when there is a resource shortage. This removes the softdep hook from the kernel and various hacks in header files to support it. Reviewed by/Discussed with: tegge, truckman, mckusick Tested by: kris
# e1cef627	31-Oct-2005	Paul Saab <ps@FreeBSD.org>	Rate limit filesystem full and out of inodes messages to once a second.
# 48c2ac45	10-Oct-2005	Tor Egge <tegge@FreeBSD.org>	Avoid unintended VMIO on directories and symlinks due to leftover object not having been destroyed.
# c73e9e9c	09-Oct-2005	Tor Egge <tegge@FreeBSD.org>	Reinitialize v_type and v_op fields in case vnode has been reused without reclamation. If the vnode previously was a fifo then v_op would point to ffs_fifoops[12] instead of the expected ffs_vnodeops[12], causing a panic at the end of ffsext_strategy.
# 448434c3	03-Oct-2005	Don Lewis <truckman@FreeBSD.org>	Initialize the inode i_flag field in ffs_valloc() to clean up any stale flag bits left over from before the inode was recycled. Without this change, a leftover IN_SPACECOUNTED flag could prevent softdep_freefile() and softdep_releasefile() from incrementing fs_pendinginodes. Because handle_workitem_freefile() unconditionally decrements fs_pendinginodes, a negative value could be reported at file system unmount time with a message like: unmount pending error: blocks 0 files -3 The pending block count in fs_pendingblocks could also be negative for similar reasons. These errors can cause the data returned by statfs() to be slightly incorrect. Some other cleanup code in softdep_releasefile() could also be incorrectly bypassed. MFC after: 3 days
# 5f419982	28-Sep-2005	Robert Watson <rwatson@FreeBSD.org>	Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57, osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60, svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81, svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55, svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10, ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58, unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133: Now that Giant is acquired in uprintf() and tprintf(), the caller no longer leads to acquire Giant unless it also holds another mutex that would generate a lock order reversal when calling into these functions. Specifically not backed out is the acquisition of Giant in nfs_socket.c and rpcclnt.c, where local mutexes are held and would otherwise violate the lock order with Giant. This aligns this code more with the eventual locking of ttys. Suggested by: bde
# 84d2b7df	19-Sep-2005	Robert Watson <rwatson@FreeBSD.org>	Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(), as they both interact with the tty code (!MPSAFE) and may sleep if the tty buffer is full (per comment). Modify all consumers of uprintf() and tprintf() to hold Giant around calls into these functions. In most cases, this means adding an acquisition of Giant immediately around the function. In some cases (nfs_timer()), it means acquiring Giant higher up in the callout. With these changes, UFS no longer panics on SMP when either blocks are exhausted or inodes are exhausted under load due to races in the tty code when running without Giant. NB: Some reduction in calls to uprintf() in the svr4 code is probably desirable. NB: In the case of nfs_timer(), calling uprintf() while holding a mutex, or even in a callout at all, is a bad idea, and will generate warnings and potential upset. This needs to be fixed, but was a problem before this change. NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having non-MPSAFE tty code. MFC after: 1 week
# a16baf37	20-Feb-2005	Xin LI <delphij@FreeBSD.org>	The recomputation of file system summary at mount time can be a very slow process, especially for large file systems that is just recovered from a crash. Since the summary is already re-sync'ed every 30 second, we will not lag behind too much after a crash. With this consideration in mind, it is more reasonable to transfer the responsibility to background fsck, to reduce the delay after a crash. Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to control this behavior. When set to nonzero, we will get the "old" behavior, that the summary is computed immediately at mount time. Add five new sysctl variables to adjust ndir, nbfree, nifree, nffree and numclusters respectively. Teach fsck_ffs about these API, however, intentionally not to check the existence, since kernels without these sysctls must have recomputed the summary and hence no adjustments are necessary. This change has eliminated the usual tens of minutes of delay of mounting large dirty volumes. Reviewed by: mckusick MFC After: 1 week
# adf41577	09-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Make a some SYSCTL_NODEs and some of FFS's VFS_ methods static.
# efd6d980	08-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Don't use the UFS_* and VFS_* functions where a direct call is possble. The UFS_ functions are for UFS to call back into VFS. The VFS functions are external entry points into the filesystem.
# 8e37fbad	24-Jan-2005	Jeff Roberson <jeff@FreeBSD.org>	- Don't use atomic operations to deal with the active array, instead it is now quite naturally protected by the ufsmount mutex. - Use the ufs lock to protect various fields in struct fs, primarily the cg summary needs protection to avoid allocation races. Several functions have been slightly re-arranged to reduce the number of lock operations. - Adjust several functions (blkfree, freefile, etc.) to accept a ufsmount as an argument so that we may access the ufs lock. Sponsored By: Isilon Systems, Inc.
# 60727d8b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# 364ed814	09-Dec-2004	Kirk McKusick <mckusick@FreeBSD.org>	Fixes a bug that caused UFS2 filesystems bigger than 2TB to prematurely report that they were full and/or to panic the kernel with the message ``ffs_clusteralloc: allocated out of group''. Submitted by: Henry Whincup <henry@jot.to> MFC after: 1 week
# 43920011	29-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Move UFS from DEVFS backing to GEOM backing. This eliminates a bunch of vnode overhead (approx 1-2 % speed improvement) and gives us more control over the access to the storage device. Access counts on the underlying device are not correctly tracked and therefore it is possible to read-only mount the same disk device multiple times: syv# mount -p /dev/md0 /var ufs rw 2 2 /dev/ad0 /mnt ufs ro 1 1 /dev/ad0 /mnt2 ufs ro 1 1 /dev/ad0 /mnt3 ufs ro 1 1 Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches things in RAM) this is not possible with read-write mounts, and the system will correctly reject this. Details: Add a geom consumer and a bufobj pointer to ufsmount. Eliminate the vnode argument from softdep_disk_prewrite(). Pick the vnode out of bp->b_vp for now. Eventually we should find it through bp->b_bufobj->b_private. In the mountcode, use g_vfs_open() once we have used VOP_ACCESS() to check permissions. When upgrading and downgrading between r/o and r/w do the right thing with GEOM access counts. Remove all the workarounds for not being able to do this with VOP_OPEN(). If we are the root mount, drop the exclusive access count until we upgrade to r/w. This allows fsck of the root filesystem and the MNT_RELOAD to work correctly. Set bo_private to the GEOM consumer on the device bufobj. Change the ffs_ops->strategy function to call g_vfs_strategy() In ufs_strategy() directly call the strategy on the disk bufobj. Same in rawread. In ffs_fsync() we will no longer see VCHR device nodes, so remove code which synced the filesystem mounted on it, in case we came there. I'm not sure this code made sense in the first place since we would have taken the specfs route on such a vnode. Redo the highly bogus readblock() function in the snapshot code to something slightly less bogus: Constructing an uio and using physio was really quite a detour. Instead just fill in a bio and ship it down.
# 60c97629	20-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Explicitly break out NETA license from Berkeley license to clearly indicate license grant, as well as to indicate that NETA is asserting only two clauses, not four clauses. Requested by: imp
# 883d3c0c	13-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove the buffercache/vnode side of BIO_DELETE processing in preparation for integration of p4::phk_bufwork. In the future, local filesystems will talk to GEOM directly and they will consequently be able to issue BIO_DELETE directly. Since the removal of the fla driver, BIO_DELETE has effectively been a no-op anyway.
# b403319b	28-Jul-2004	Alexander Kabaev <kan@FreeBSD.org>	Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper inode field based on UFS version. Use DIP ro read values and DIP_SET to modify them throughout FFS code base.
# 56f21b9d	26-Jul-2004	Colin Percival <cperciva@FreeBSD.org>	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
# 89c9c53d	16-Jun-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
# 83d8045f	19-May-2004	Ken Smith <kensmith@FreeBSD.org>	Style fixup in previous commit. Noticed by: bde (thanks!)
# f7dd67d8	14-May-2004	Ken Smith <kensmith@FreeBSD.org>	Change ffs_realloccg() to set the valid bits for the extended part of the fragment to zero the valid parts of a VM_IO buffer. RE would like this to be part of 4.10-RC3 so this will be MFC-ed immediately. Reviewed by: alc, tegge
# 012d4134	06-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and irc message from Robert Watson saying that clause 3 can be removed from those files with an NAI copyright that also have only a University of California copyrights. Approved by: core, rwatson
# c355fd5a	16-Mar-2004	Alexander Kabaev <kan@FreeBSD.org>	Avoid doing bawrite to initialize inode block while holding cylinder group block locked. If filesystem has any active snapshots, bawrite can come back trying to allocate new snapshot data block from the same cylinder group and cause panic due to recursive lock attempt. PR: 64206 Reviewed by: mckusick Tested by: pjd
# 9f206707	31-Oct-2003	Don Lewis <truckman@FreeBSD.org>	Tweak the calculation of minbfree in ffs_dirpref() so that only those cylinder groups that have at least 75% of the average free space per cylinder group for that file system are considered as candidates for the creation of a new directory. The previous formula for minbfree would set it to zero if the file system was more than 75% full, which allowed cylinder groups with no free space at all to be chosen as candidates for directory creation, which resulted in an expensive search for free blocks for each file that was subsequently created in that directory. Modify the calculation of minifree in the same way. Decrease maxcontigdirs as the file system fills to decrease the likelyhood that a cluster of directories will overflow the available space in a cylinder group. Reviewed by: mckusick Tested by: kmarx@vicor.com MFC after: 2 weeks
# f4636c59	11-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# 6280ed26	31-May-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused local variables. Found by: FlexeLint
# 2a53bfbe	20-Mar-2003	John Baldwin <jhb@FreeBSD.org>	Minor fixes to ffs_fserr(): - Assume that curthread is not NULL. It never is in -current. - Use td_ucred instead of p_ucred.
# b4b138c2	18-Mar-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Including <sys/stdint.h> is (almost?) universally only to be able to use %j in printfs, so put a newsted include in <sys/systm.h> where the printf prototype lives and save everybody else the trouble.
# 7261f5f6	03-Mar-2003	Jeff Roberson <jeff@FreeBSD.org>	- Add a new 'flags' parameter to getblk(). - Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT flag to the initial BUF_LOCK(). This will eventually be used in cases were we want to use a buffer only if it is not currently in use. - Convert all consumers of the getblk() api to use this extra parameter. Reviwed by: arch Not objected to by: mckusick
# 37e2ebfd	21-Feb-2003	Kirk McKusick <mckusick@FreeBSD.org>	This patch fixes a bug on an active filesystem on which a snapshot is being taken from panicing with either "freeing free block" or "freeing free inode". The problem arises when the snapshot code is scanning the filesystem looking for inodes with a reference count of zero (e.g., unlinked but still open) so that it can expunge them from its view. If it encounters a reclaimed vnode and has to restart its scan, then it will panic if it encounters and tries to free an inode that it has already processed. The fix is to check each candidate inode to see if it has already been processed before trying to delete it from the snapshot image. Sponsored by: DARPA & NAI Labs.
# aca3e497	14-Feb-2003	Kirk McKusick <mckusick@FreeBSD.org>	Replace use of random() with arc4random() to provide less guessable values for the initial inode generation numbers in newfs and for newly allocated inode generation numbers in the kernel. Submitted by: Theo de Raadt <deraadt@cvs.openbsd.org> Sponsored by: DARPA & NAI Labs.
# 50bd54e3	13-Feb-2003	Kirk McKusick <mckusick@FreeBSD.org>	Correct lines incorrectly added to the copyright message. Submitted by: Frank van der Linden <fvdl@wasabisystems.com> Sponsored by: DARPA & NAI Labs.
# 48e3128b	12-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
# cd72f218	11-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
# 13438f68	31-Dec-2002	Alfred Perlstein <alfred@FreeBSD.org>	When compiling the kernel do not implicitly include filedesc.h from proc.h, this was causing filedesc work to be very painful. In order to make this work split out sigio definitions to thier own header (sigio.h) which is included from proc.h for the time being.
# c021e447	17-Dec-2002	Kirk McKusick <mckusick@FreeBSD.org>	Cosmetic cleanup of unsigned buglets. Submitted by: Bruce Evans <bde@zeta.org.au> Sponsored by: DARPA & NAI Labs.
# 8d6754f2	05-Dec-2002	Kirk McKusick <mckusick@FreeBSD.org>	More tightly verify the preference returned for the new inode. Submitted by: Kris Kennaway <kris@obsecurity.org> Sponsored by: DARPA & NAI Labs. Approved by: re
# 47a56126	18-Sep-2002	David E. O'Brien <obrien@FreeBSD.org>	intmax_t is printed with %jd, not %lld.
# e6e370a7	04-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# 9fbc6a33	29-Jul-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Fix braino in last commit.
# 17b1994b	30-Jul-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Move ffs_isfreeblock() to ffs_alloc.c and make it static. Sponsored by: DARPA & NAI Labs.
# 7aca6291	19-Jul-2002	Kirk McKusick <mckusick@FreeBSD.org>	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
# faab4e27	16-Jul-2002	Kirk McKusick <mckusick@FreeBSD.org>	Change the name of st_createtime to st_birthtime. This change is made to reduce confusion between st_ctime and st_createtime. Submitted by: Eric Allman <eric@sendmail.org> Sponsored by: DARPA & NAI Labs.
# 2daf9dc8	07-Jul-2002	Bruce Evans <bde@FreeBSD.org>	Fixed some printf format errors (4 new ones reported by gcc and 5 nearby old ones not reported by gcc). This helps unbreak LINT.
# cfbf0a46	23-Jun-2002	Maxime Henrion <mux@FreeBSD.org>	Warning fixes for 64 bits platforms. This eliminates all the warnings I have had in the FFS code on sparc64. Reviewed by: mckusick
# 5006e776	22-Jun-2002	Kirk McKusick <mckusick@FreeBSD.org>	This patch fixes a problem whereby filesystems that ran out of inodes in a cylinder group would fail to check for free inodes in other cylinder groups. This bug was introduced in the UFS2 code merge two days ago. An inode is allocated by calling ffs_valloc which calls ffs_hashalloc to do the filesystem scan. Ffs_hashalloc walks around the cylinder groups calling its passed allocator (ffs_nodealloccg in this case) until the allocator returns a non-zero result. The bug is that ffs_hashalloc expects the passed allocator function to return a 64-bit ufs2_daddr_t. When allocating inodes, it calls ffs_nodealloccg which was returning a 32-bit ino_t. The ffs_hashalloc code checked a 64-bit return value and usually found random non-zero bits in the high 32-bits so decided that the allocation had succeeded (in this case in the only cylinder group that it checked). When the result was passed back to ffs_valloc it looked at only the bottom 32-bits, saw zero and declared the system out of inodes. But ffs_hashalloc had really only checked one cylinder group. The fix is to change ffs_nodealloccg to return 64-bit results. Sponsored by: DARPA & NAI Labs. Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk> Reviewed by: Maxime Henrion <mux@freebsd.org>
# 1c85e6a3	21-Jun-2002	Kirk McKusick <mckusick@FreeBSD.org>	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
# d394511d	16-May-2002	Tom Rhodes <trhodes@FreeBSD.org>	More s/file system/filesystem/g
# 05f4ff5d	13-May-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Remove register keyword. Sponsored by: DARPA & NAI Labs. Submitted by: mckusick
# 44731cab	01-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 6f1e8551	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# a0595d02	16-Mar-2002	Kirk McKusick <mckusick@FreeBSD.org>	Add a flags parameter to VFS_VGET to pass through the desired locking flags when acquiring a vnode. The immediate purpose is to allow polling lock requests (LK_NOWAIT) needed by soft updates to avoid deadlock when enlisting other processes to help with the background cleanup. For the future it will allow the use of shared locks for read access to vnodes. This change touches a lot of files as it affects most filesystems within the system. It has been well tested on FFS, loopback, and CD-ROM filesystems. only lightly on the others, so if you find a problem there, please let me (mckusick@mckusick.com) know.
# b06051cf	07-Feb-2002	Kirk McKusick <mckusick@FreeBSD.org>	Occationally background fsck would cause a spurious ``freeing free inode'' panic. This change corrects that problem by setting the fs_active flag when the inode map changes to notify the snapshot code that the cylinder group must be rescanned. Submitted by: Robert Watson <rwatson@FreeBSD.org>
# c9f96392	01-Feb-2002	Kirk McKusick <mckusick@FreeBSD.org>	When taking a snapshot, we must check for active files that have been unlinked (e.g., with a zero link count). We have to expunge all trace of these files from the snapshot so that they are neither reclaimed prematurely by fsck nor saved unnecessarily by dump.
# 03a2057a	21-Jan-2002	Kirk McKusick <mckusick@FreeBSD.org>	This patch fixes a long standing complaint with soft updates in which small and/or nearly full filesystems would fail with `file system full' messages when trying to replace a number of existing files (for example during a system installation). When the allocation routines are about to fail with a file system full condition, they make a call to softdep_request_cleanup() which attempts to accelerate the flushing of pending deletion requests in an effort to free up space. In the face of filesystem I/O requests that exceed the available disk transfer capacity, the cleanup request could take an unbounded amount of time. Thus, the softdep_request_cleanup() routine will only try for tickdelay seconds (default 2 seconds) before giving up and returning a filesystem full error. Under typical conditions, the softdep_request_cleanup() routine is able to free up space in under fifty milliseconds.
# 426da3bc	13-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
# f305c5d1	18-Dec-2001	Kirk McKusick <mckusick@FreeBSD.org>	Change the atomic_set_char to atomic_set_int and atomic_clear_char to atomic_clear_int to ease the implementation for the sparc64. Requested by: Jake Burkholder <jake@locore.ca>
# cc5a9233	13-Dec-2001	Kirk McKusick <mckusick@FreeBSD.org>	Minimize the time necessary to suspend operations on a filesystem when taking a snapshot. The two time consuming operations are scanning all the filesystem bitmaps to determine which blocks are in use and scanning all the other snapshots so as to be able to expunge their blocks from the view of the current snapshot. The bitmap scanning is broken into two passes. Before suspending the filesystem all bitmaps are scanned. After the suspension, those bitmaps that changed after being scanned the first time are rescanned. Typically there are few bitmaps that need to be rescanned. The expunging of other snapshots is now done after the suspension is released by observing that we can easily identify any blocks that were allocated to them after the suspension (they will be maked as `not needing to be copied' in the just created snapshot). For all the gory details, see the ``Running fsck in the Background'' paper in the Usenix BSDCon 2002 Conference Proceedings, pages 55-64.
# ab66aa14	02-Oct-2001	Robert Watson <rwatson@FreeBSD.org>	o Replace two direct uid!=0 comparisons with suser_xxx() calls. Obtained from: TrustedBSD Project
# 78236790	15-Jun-2001	Peter Wemm <peter@FreeBSD.org>	Fix warning: 1973: warning: int format, long int arg (arg 5)
# 23371b2f	03-May-2001	Kirk McKusick <mckusick@FreeBSD.org>	Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.
# 60fb0ce3	28-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Revert consequences of changes to mount.h, part 2. Requested by: bde
# d98dc34f	23-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Correct #includes to work with fixed sys/mount.h.
# f0f3f19f	16-Apr-2001	Kirk McKusick <mckusick@FreeBSD.org>	Background fsck sysctl operations must use vn_start_write and vn_finished_write so that they do not attempt to modify a suspended filesystem.
# a61ab64a	10-Apr-2001	Kirk McKusick <mckusick@FreeBSD.org>	Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>. His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved. ------ One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm. First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are: 1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35 2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50 You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html Test Results tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66 "old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref. ------ Algorithm description The old dirpref algorithm is described in comments: /* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. / A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent. What I mean by a big file system ? 1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache. The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations. My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: / * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. / My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files. My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array. The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are: int32_t fs_avgfilesize; / expected average file size / int32_t fs_avgfpdir; / expected # of files per directory */ These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache. I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down. Obtained from: Grigoriy Orlov <gluk@ptci.ru>
# 5d0b660f	24-Mar-2001	Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>	Fix typo ); -> ,
# fca26df0	23-Mar-2001	Kirk McKusick <mckusick@FreeBSD.org>	Check that background fsck operation is being done on a ufs filesystem. Obtained from: Robert Watson <rwatson@FreeBSD.org>
# 812b1d41	20-Mar-2001	Kirk McKusick <mckusick@FreeBSD.org>	Add kernel support for running fsck on active filesystems.
# 7e72e991	20-Mar-2001	Kirk McKusick <mckusick@FreeBSD.org>	Report the correct inode number when panicing with freeing free inode. Report the correct block number when panicing with freeing free block.
# d7d97eb0	18-Feb-2001	Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>	Preceed/preceeding are not english words. Use precede and preceding.
# 6ee6b42e	28-Jul-2000	Peter Wemm <peter@FreeBSD.org>	Minor change: fix warning - move a 'struct vnode *vp' declaration inside a #ifdef DIAGNOSTIC to match its corresponding usage.
# f2a2857b	11-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 9626b608	05-May-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# a64ed089	14-Apr-2000	Robert Watson <rwatson@FreeBSD.org>	Introduce extended attribute support for FFS, allowing arbitrary (name, value) pairs to be associated with inodes. This support is used for ACLs, MAC labels, and Capabilities in the TrustedBSD security extensions, which are currently under development. In this implementation, attributes are backed to data vnodes in the style of the quota support in FFS. Support for FFS extended attributes may be enabled using the FFS_EXTATTR kernel option (disabled by default). Userland utilities and man pages will be committed in the next batch. VFS interfaces and man pages have been in the repo since 4.0-RELEASE and are unchanged. o ufs/ufs/extattr.h: UFS-specific extattr defines o ufs/ufs/ufs_extattr.c: bulk of support routines o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes o contrib/softupdates/ffs_softdep.c: extattr.h includes o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h (This should not be the case, and will be fixed in a future commit) Currently attributes are not supported in MFS. This will be fixed. Reviewed by: adrian, bp, freebsd-fs, other unthanked souls Obtained from: TrustedBSD Project
# 9f043878	15-Mar-2000	Kirk McKusick <mckusick@FreeBSD.org>	Use 64-bit math to decide if optimization needs to be changed. Necessary for coherent results on filesystems bigger than 0.5Tb. Submitted by: Paul Saab <ps@yahoo-inc.com>
# cf60e8e4	09-Jan-2000	Kirk McKusick <mckusick@FreeBSD.org>	Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
# 369dc8ce	21-Dec-1999	Eivind Eklund <eivind@FreeBSD.org>	Change incorrect NULLs to 0s
# 9f54c052	01-Dec-1999	Kirk McKusick <mckusick@FreeBSD.org>	Preferentially allocate the first indirect block in the same cylinder group as the inode. This makes a 15% difference in read speed for files in the 96K to 500K size range.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# 740e3a15	24-Aug-1999	Sheldon Hearn <sheldonh@FreeBSD.org>	Fix bug introduced in rev 1.28, which causes kernel build to break for the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is declared conditionally on DIAGNOSTIC, not DEBUG. PR: 13314 Reviewed by: bde
# d9183205	23-Aug-1999	Bruce Evans <bde@FreeBSD.org>	Use devtoname() to print dev_t's instead of casting them to long or u_long for misprinting in %lx format.
# 51b52266	12-May-1999	Peter Wemm <peter@FreeBSD.org>	Try and fix a dev_t/major/minor etc nit.
# dfd5dee1	06-May-1999	Peter Wemm <peter@FreeBSD.org>	Add sufficient braces to keep egcs happy about potentially ambiguous if/else nesting.
# de5d1ba5	07-Jan-1999	Bruce Evans <bde@FreeBSD.org>	Don't pass unused unused timestamp args to UFS_UPDATE() or waste time initializing them. This almost finishes centralizing (in-core) timestamp updates in ufs_itimes().
# d64dbc87	06-Jan-1999	Bruce Evans <bde@FreeBSD.org>	Ifdefed the conditionally used variable `prtrealloc'. Declare it as volatile so that there is no chance that the code that it controls is optimised away.
# 1c680b45	12-Nov-1998	David Greenman <dg@FreeBSD.org>	Restored the "reallocblks" code to its former glory. What this does is basically do a on-the-fly defragmentation of the FFS filesystem, changing file block allocations to make them contiguous. Thanks to Kirk McKusick for providing hints on what needed to be done to get this working.
# ff261f16	07-Sep-1998	Bruce Evans <bde@FreeBSD.org>	Put the zombie ffs sysctl node in "notyet" state together with its few remaining children. Prepare it for MOUNT_UFS going away.
# 0375c9f2	05-Sep-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform device drivers about sectors no longer in use. Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags. This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions. Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)
# 0492d857	17-Aug-1998	Bruce Evans <bde@FreeBSD.org>	Removed unused includes.
# ac1e407b	11-Jul-1998	Bruce Evans <bde@FreeBSD.org>	Fixed printf format errors.
# 227ee8a1	30-Mar-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
# b1897c19	08-Mar-1998	Julian Elischer <julian@FreeBSD.org>	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# 23906a78	02-Dec-1997	Bruce Evans <bde@FreeBSD.org>	Fix a small style bug in the generation number change (rev.1.33) before copying the change to other fs's.
# cb451ebd	22-Nov-1997	Bruce Evans <bde@FreeBSD.org>	Staticized.
# 2ea354c3	22-Nov-1997	Bruce Evans <bde@FreeBSD.org>	Unremoved prtrealloc and the declaration of ffs_clusteralloc(). These are used in the `#ifdef notyet' case :-). This case is used except in the `#if !defined (not_yes)' case :-\|. This has something to do with the `#ifdef notyet_block_reallocation_enabled' case in vfs_cluster.c :-(.
# 4a11ca4e	07-Nov-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# 987f5696	16-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Another VFS cleanup "kilo commit" 1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS} intereface function, and now lives in the ufsmount structure. 2. Remove VOP_SEEK, it was unused. 3. Add mode default vops: VOP_ADVLOCK vop_einval VOP_CLOSE vop_null VOP_FSYNC vop_null VOP_IOCTL vop_enotty VOP_MMAP vop_einval VOP_OPEN vop_null VOP_PATHCONF vop_einval VOP_READLINK vop_einval VOP_REALLOCBLKS vop_eopnotsupp And remove identical functionality from filesystems 4. Add vop_stdpathconf, which returns the canonical stuff. Use it in the filesystems. (XXX: It's probably wrong that specfs and fifofs sets this vop, shouldn't it come from the "host" filesystem, for instance ufs or cd9660 ?) 5. Try to make system wide VOP functions have vop_* names. 6. Initialize the um_* vectors in LFS. (Recompile your LKMS!!!)
# cec0f20c	16-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	VFS mega cleanup commit (x/N) 1. Add new file "sys/kern/vfs_default.c" where default actions for VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE, POLL, REVOKE and STRATEGY. Various stuff spread over the entire tree belongs here. 2. Change VOP_BLKATOFF to a normal function in cd9660. 3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These are private interface functions between UFS and the underlying storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now live in struct ufsmount instead. 4. Remove a kludge of VOP_ functions in all filesystems, that did nothing but obscure the simplicity and break the expandability. If a filesystem doesn't implement VOP_FOO, it shouldn't have an entry for it in its vnops table. The system will try to DTRT if it is not implemented. There are still some cruft left, but the bulk of it is done. 5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
# 40715905	14-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	I think my previous change may have opened a race conditio. This patch does the same thing, with no change in semantics.
# 34a6a330	14-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	ufs_ihashrem() should not be called from the UFS layer, but from the lower layer (LFS/FFS/?) like the rest of the ihash functions. Otherwise it is impossible to make a lower layer that doesn't use the ihash facility.
# ec1d10e4	19-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	[Regarding the previous patch] This is completely wrong. 1. ffs_alloc() actually allowed writing one block less one frag (normally 7 frags or 7/8 blocks) beyond the limit. 2. freebufspace() gives the free space in frags, but `size' is in bytes, so the change results in approximately `size' fragments too many being reserved. 3. ffs_realloccg() has the same bug but wasn't changed. PR: 3398 Submitted by: bde Eyeballed by: phk
# 67720457	18-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Ffs_alloc allow users to write one block beyond the limit. PR: 3398 Reviewed by: phk Submitted by: Wolfram Schneider <wosch@apfel.de>
# e4ba6a82	02-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 3758b280	04-Aug-1997	Poul-Henning Kamp <phk@FreeBSD.org>	We got a couple of "map mismatch" panics from the following code. According to the crash dump, bpref is set to 445 and cgp->cg_nclusterblks is 444. Hence in the for loop, the test fails immediately but the following failure check (got == cgp->cg_nclusterblks) doesn't trigger because got > cgp->cg_nclusterblks. This wreaks havoc in the code after that. Fix: Move one source bit to the left :-) Noticed by: Mike Hibler <mike@fast.cs.utah.edu> Submitted by: Kirk McKusick <mckusick@McKusick.COM>
# 8f89943e	23-Mar-1997	Guido van Rooij <guido@FreeBSD.org>	Add generation number randomization. Newly created filesystems wil now automatically have random generation numbers. The kenel way of handling those also changed. Further it is advised to run fsirand on all your nfs exported filesystems. the code is mostly copied from OpenBSD, with the randomization chanegd to use /dev/urandom Reviewed by: Garrett Obtained from: OpenBSD
# 3c816944	21-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Fixed some invalid (non-atomic) accesses to `time', mostly ones of the form `tv = time'. Use a new function gettime(). The current version just forces atomicicity without fixing precision or efficiency bugs. Simplified some related valid accesses by using the central function.
# 5ace3b26	08-Mar-1997	Mike Pritchard <mpp@FreeBSD.org>	Update a number of panic messages to reflect the actual name of the routine that caused the panic.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 812ac98e	10-Feb-1997	Mike Pritchard <mpp@FreeBSD.org>	Correct the new Lite2 #ifdef DIAGNOSTIC ffs_checkblk routine to not return without setting a return value when it can't read a block error or detects a bad cylinder group, since the caller is expecting a return value. It will now panic at this point, since the thing to do in this case would be to return a "bad block" status to the caller, and the caller will panic anyways when that happens. Also updated to panic strings in this routine to read "ffs_checkblk: ..." instead of "checkblk: ...".
# 996c772f	09-Feb-1997	John Dyson <dyson@FreeBSD.org>	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 7b8830a5	17-Sep-1996	Peter Wemm <peter@FreeBSD.org>	Argh, I have had one "uid 0 on /: file system full" too many. The problem is that it doesn't say _what_ did it! (the core dumped console message is very useful for listing the process name and pid). This adds similar information.
# 6ab46d52	11-Jul-1996	Bruce Evans <bde@FreeBSD.org>	Don't use NULL in non-pointer contexts.
# 6ddbf1e2	07-May-1996	Gary Palmer <gpalmer@FreeBSD.org>	Clean up various compiler warnings. Most (if not all) were benign Reviewed by: bde
# e1eec28a	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change.
# 01733a9b	05-Jan-1996	Garrett Wollman <wollman@FreeBSD.org>	Convert QUOTA to new-style option.
# b8dce649	17-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Staticize.
# dbe525ff	14-Dec-1995	Peter Wemm <peter@FreeBSD.org>	Silence a harmless warning...
# 57a4e3fa	03-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Completed function declarations and/or added prototypes and/or #includes to get the prototypes.
# aa396019	19-Nov-1995	John Dyson <dyson@FreeBSD.org>	General fixes to the vfs clustring code: 1) Make cluster buffer list be a non-malloced chain. This eliminates yet another 'evil' M_WAITOK and generally cleans up the code. 2) Fix write clustering for ext2fs. It was just broken. Also, ffs clustering had an efficiency problem that more bawrites were happening than should have been. 3) Make changes to buf.h to support the above, plus remove b_pfcent at the request of David Greenman. Note that the reallocblocks code is disabled pending rewrite for the cluster buffer list changes.
# af8364b0	14-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Get rid of the last debug sysctl variables of the old style.
# 9eda220c	08-Sep-1995	David Greenman <dg@FreeBSD.org>	Slight optimization for the standard case of rotdelay=0.
# 4238af7c	25-Aug-1995	Bruce Evans <bde@FreeBSD.org>	Don't call VOP_UPDATE() with volatile timestamps.
# 39e68756	07-Aug-1995	David Greenman <dg@FreeBSD.org>	Use bdwrite() rather than brelse(). The cylinder group bitmap modification is not preserved otherwise. Note that this is a no-op in FreeBSD, however, as we have doreallocblks disabled. Submitted by: Kirk McKusick
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# b2b795f0	11-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Fix -Wformat warnings from LINT kernel.
# f57459b6	26-Mar-1995	David Greenman <dg@FreeBSD.org>	Removed third arg (vmio) to allocbuf() that was added with the original merged cache changes, and figure it out based on the B_VMIO buffer flag. Fixes a problem where delayed write VMIO buffers would sometimes get recopied into kernel-alloced memory. Submitted by: John Dyson
# edf8a815	19-Mar-1995	David Greenman <dg@FreeBSD.org>	Removed redundant newlines that were in some panic strings.
# e23c0ff5	10-Mar-1995	David Greenman <dg@FreeBSD.org>	The threshold for switching from time-space and space-time is too small when minfree is 5%...so make it stay at space in this case. Submitted by: Kirk McKusick
# 22470903	03-Mar-1995	David Greenman <dg@FreeBSD.org>	Fixes from John Dyson to work around vnode lock hang. Basically, remove the VOP_BMAP calls, and add one to bdwrite. Submitted by: John Dyson
# 7368f324	27-Feb-1995	Stefan Eßer <se@FreeBSD.org>	Don't try to make use of useless rotational position optimisation, if all free blocks are in the same bucket (i.e. NRPOS == 1). Else a free block is choosen, possibly from a different cylinder, even if the block succeeding bpref was free ... Submitted by: se
# d2fc5315	13-Feb-1995	Poul-Henning Kamp <phk@FreeBSD.org>	YF fix.
# 0d94caff	09-Jan-1995	David Greenman <dg@FreeBSD.org>	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# c1d9efcb	09-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Cosmetics. make gcc less noisy. Still some way to go here.
# 1819efa6	19-Sep-1994	Bruce Evans <bde@FreeBSD.org>	Use `1' for a boolean value instead of something irrelevant (MNT_WAIT) that happens to be nonzero.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources