Cross Reference: /freebsd-10.1-release/sys/ufs/ufs/ufs

History log of /freebsd-10.1-release/sys/ufs/ufs/ufs_vnops.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 272461	02-Oct-2014	gjb	Copy stable/10@r272459 to releng/10.1 as part of the 10.1-RELEASE process. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.1-release
# 270695	26-Aug-2014	kib	MFC r270204: Do not busy the UFS mount point inside VOP_RENAME().
# 269283	30-Jul-2014	kib	MFC r268764: Check for the cross-device cross-link attempt in the VFS, instead of VOP_LINK() implemenations.
# 267816	24-Jun-2014	kib	MFC r267564: In msdosfs_setattr(), add a check for result of the utimes(2) permissions test. Refactor the permission checks for utimes(2).
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 254627	21-Aug-2013	ken	Expand the use of stat(2) flags to allow storing some Windows/DOS and CIFS file attributes as BSD stat(2) flags. This work is intended to be compatible with ZFS, the Solaris CIFS server's interaction with ZFS, somewhat compatible with MacOS X, and of course compatible with Windows. The Windows attributes that are implemented were chosen based on the attributes that ZFS already supports. The summary of the flags is as follows: UF_SYSTEM: Command line name: "system" or "usystem" ZFS name: XAT_SYSTEM, ZFS_SYSTEM Windows: FILE_ATTRIBUTE_SYSTEM This flag means that the file is used by the operating system. FreeBSD does not enforce any special handling when this flag is set. UF_SPARSE: Command line name: "sparse" or "usparse" ZFS name: XAT_SPARSE, ZFS_SPARSE Windows: FILE_ATTRIBUTE_SPARSE_FILE This flag means that the file is sparse. Although ZFS may modify this in some situations, there is not generally any special handling for this flag. UF_OFFLINE: Command line name: "offline" or "uoffline" ZFS name: XAT_OFFLINE, ZFS_OFFLINE Windows: FILE_ATTRIBUTE_OFFLINE This flag means that the file has been moved to offline storage. FreeBSD does not have any special handling for this flag. UF_REPARSE: Command line name: "reparse" or "ureparse" ZFS name: XAT_REPARSE, ZFS_REPARSE Windows: FILE_ATTRIBUTE_REPARSE_POINT This flag means that the file is a Windows reparse point. ZFS has special handling code for reparse points, but we don't currently have the other supporting infrastructure for them. UF_HIDDEN: Command line name: "hidden" or "uhidden" ZFS name: XAT_HIDDEN, ZFS_HIDDEN Windows: FILE_ATTRIBUTE_HIDDEN This flag means that the file may be excluded from a directory listing if the application honors it. FreeBSD has no special handling for this flag. The name and bit definition for UF_HIDDEN are identical to the definition in MacOS X. UF_READONLY: Command line name: "urdonly", "rdonly", "readonly" ZFS name: XAT_READONLY, ZFS_READONLY Windows: FILE_ATTRIBUTE_READONLY This flag means that the file may not written or appended, but its attributes may be changed. ZFS currently enforces this flag, but Illumos developers have discussed disabling enforcement. The behavior of this flag is different than MacOS X. MacOS X uses UF_IMMUTABLE to represent the DOS readonly permission, but that flag has a stronger meaning than the semantics of DOS readonly permissions. UF_ARCHIVE: Command line name: "uarch", "uarchive" ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE Windows name: FILE_ATTRIBUTE_ARCHIVE The UF_ARCHIVED flag means that the file has changed and needs to be archived. The meaning is same as the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute. msdosfs and ZFS have special handling for this flag. i.e. they will set it when the file changes. sys/param.h: Bump __FreeBSD_version to 1000047 for the addition of new stat(2) flags. chflags.1: Document the new command line flag names (e.g. "system", "hidden") available to the user. ls.1: Reference chflags(1) for a list of file flags and their meanings. strtofflags.c: Implement the mapping between the new command line flag names and new stat(2) flags. chflags.2: Document all of the new stat(2) flags, and explain the intended behavior in a little more detail. Explain how they map to Windows file attributes. Different filesystems behave differently with respect to flags, so warn the application developer to take care when using them. zfs_vnops.c: Add support for getting and setting the UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN, UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags. All of these flags are implemented using attributes that ZFS already supports, so the on-disk format has not changed. ZFS currently doesn't allow setting the UF_REPARSE flag, and we don't really have the other infrastructure to support reparse points. msdosfs_denode.c, msdosfs_vnops.c: Add support for getting and setting UF_HIDDEN, UF_SYSTEM and UF_READONLY in MSDOSFS. It supported SF_ARCHIVED, but this has been changed to be UF_ARCHIVE, which has the same semantics as the DOS archive attribute instead of inverse semantics like SF_ARCHIVED. After discussion with Bruce Evans, change several things in the msdosfs behavior: Use UF_READONLY to indicate whether a file is writeable instead of file permissions, but don't actually enforce it. Refuse to change attributes on the root directory, because it is special in FAT filesystems, but allow most other attribute changes on directories. Don't set the archive attribute on a directory when its modification time is updated. Windows and DOS don't set the archive attribute in that scenario, so we are now bug-for-bug compatible. smbfs_node.c, smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM, UF_READONLY and UF_ARCHIVE in SMBFS. This is similar to changes that Apple has made in their version of SMBFS (as of smb-583.8, posted on opensource.apple.com), but not quite the same. We map SMB_FA_READONLY to UF_READONLY, because UF_READONLY is intended to match the semantics of the DOS readonly flag. The MacOS X code maps both UF_IMMUTABLE and SF_IMMUTABLE to SMB_FA_READONLY, but the immutable flags have stronger meaning than the DOS readonly bit. stat.h: Add definitions for UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY and UF_HIDDEN. The definition of UF_HIDDEN is the same as the MacOS X definition. Add commented-out definitions of UF_COMPRESSED and UF_TRACKED. They are defined in MacOS X (as of 10.8.2), but we do not implement them (yet). ufs_vnops.c: Add support for getting and setting UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY, UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS. Alphabetize the flags that are supported. These new flags are only stored, UFS does not take any action if the flag is set. Sponsored by: Spectra Logic Reviewed by: bde (earlier version)
# 253998	06-Aug-2013	mckusick	This bug fix is in a code path in rename taken when there is a collision between a rename and an open system call for the same target file. Here, rename releases its vnode references, waits for the open to finish, and then restarts by reacquiring its needed vnode locks. In this case, rename was unlocking but failing to release its reference to one of its held vnodes. The effect was that even after all the actual references to the vnode had gone, the vnode still showed active references. For files that had been removed, their space was not reclaimed until the filesystem was forcibly unmounted. This bug manifested itself in the Postgres server which would leak/lose hundreds of files per day amounting to many gigabytes of disk space. This bug required shutting down Postgres, forcibly unmounting its filesystem, remounting its filesystem and restarting Postgres every few days to recover the lost space. Reported by: Dan Thomas and Palle Girgensohn Bug-fix by: kib Tested by: Dan Thomas and Palle Girgensohn MFC after: 2 weeks
# 252438	01-Jul-2013	gleb	Don't assume that UFS on-disk format of a directory is the same as defined by <sys/dirent.h> Always start parsing at DIRBLKSIZ aligned offset, skip first entries if uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too small for single entry. Preallocate buffer for cookies. Cookies will be replaced with d_off field in struct dirent at later point. Skip entries with zero inode number. Stop mangling dirent in ufs_extattr_iterate_directory(). Reviewed by: kib Sponsored by: Google Summer Of Code 2011
# 248422	17-Mar-2013	kib	Remove negative name cache entry pointing to the target name, which could be instantiated while tdvp was unlocked. Reported by: Rick Miller <vmiller at hostileadmin com> Tested by: pho MFC after: 1 week
# 241011	27-Sep-2012	mdf	Fix up kernel sources to be ready for a 64-bit ino_t. Original code by: Gleb Kurtsou
# 236044	26-May-2012	kib	Implement SEEK_HOLE/SEEK_DATA for UFS. MFC after: 2 weeks
# 234605	23-Apr-2012	trasz	Remove unused thread argument from vtruncbuf(). Reviewed by: kib
# 234421	18-Apr-2012	jh	The part about exec atime no longer applies in the comment. Pointed out by: bde
# 234103	10-Apr-2012	jh	- Return EPERM from ufs_setattr() when an user without PRIV_VFS_SYSFLAGS privilege attempts to toggle SF_SETTABLE flags. - Use the '^' operator in the SF_SNAPSHOT anti-toggling check. Flags are now stored to ip->i_flags in one place after all checks. Submitted by: bde
# 233875	04-Apr-2012	jh	Add a check for unsupported file flags to ufs_setattr(). Discussed with: bde MFC after: 2 weeks
# 233817	02-Apr-2012	mckusick	A file cannot be deallocated until its last name has been removed and it is no longer referenced by a user process. The inode for a file whose name has been removed, but is still referenced at the time of a crash will still be allocated in the filesystem, but will have no references (e.g., they will have no names referencing them from any directory). With traditional soft updates these unreferenced inodes will be found and reclaimed when the background fsck is run. When using journaled soft updates, the kernel must keep track of these inodes so that it can find and reclaim them during the cleanup process. Their existence cannot be stored in the journal as the journal only handles short-term events, and they may persist for days. So, they are tracked by keeping them in a linked list whose head pointer is stored in the superblock. The journal tracks them only until their linked list pointers have been commited to disk. Part of the cleanup process involves traversing the list of unreferenced inodes and reclaiming them. This bug was triggered when confusion arose in the commit steps of keeping the unreferenced-inode linked list coherent on disk. Notably, a race between the link() system call adding a link-count to a file and the unlink() system call removing a link-count to the file. Here if the unlink() ran after link() had looked up the file but before link() had incremented the link-count of the file, the file's link-count would drop to zero before the link() incremented it back up to one. If the file was referenced by a user process, the first transition through zero made it appear that it should be added to the unreferenced-inode list when in fact it should not have been added. If the new name created by link() was deleted within a few seconds (with the file still referenced by a user process) it would legitimately be a candidate for addition to the unreferenced-inode list. The result was that there were two attempts to add the same inode to the unreferenced-inode list which scrambled the unreferenced-inode list's pointers leading to a panic. The fix is to detect and avoid the false attempt at adding it to the unreferenced-inode list by having the link() system call check to see if the link count is zero before it increments it. If it is, the link() fails with ENOENT (showing that it has failed the link()/unlink() race). While tracking down this bug, we have added additional assertions to detect the problem sooner and also simplified some of the code. Reported by: Kirk Russell Fix submitted by: Jeff Roberson Tested by: Peter Holm PR: kern/159971 MFC (to 9 only): 2 weeks
# 233787	02-Apr-2012	jh	- Use more natural ip->i_flags instead of vap->va_flags in the final flags check. - Add a comment for the immutable/append check done after handling of the flags. - Style improvements. No functional change intended. Submitted by: bde MFC after: 2 weeks
# 232821	11-Mar-2012	kib	Remove fifo.h. The only used function declaration from the header is migrated to sys/vnode.h. Submitted by: gianni
# 232401	02-Mar-2012	jhb	Similar to the fixes in 226967 and 226987, purge any name cache entries associated with the previous vnode (if any) associated with the target of a rename(). Otherwise, a lookup of the target pathname concurrent with a rename() could re-add a name cache entry after the namei(RENAME) lookup in kern_renameat() had purged the target pathname. MFC after: 2 weeks
# 231122	07-Feb-2012	kib	Sprinkle missed calls to asynchronous UFS_UPDATE() in attempt to guarantee that all UFS inode metadata changes results in the dirtiness of the inodeblock. Due to missed inodeblock updates, syncer was required to fsync each mount point' vnode to guarantee periodic metadata flush. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
# 226971	31-Oct-2011	pho	Fix the wrong commit log message for r226967: "Added missing cache purge of from argument" and fix the comment.
# 226967	31-Oct-2011	pho	The kern_renameat() looks up the fvp using the DELETE flag, which causes the removal of the name cache entry for fvp. Reported by: Anton Yuzhaninov <citrin citrin ru> In collaboration with: kib MFC after: 1 week
# 223020	12-Jun-2011	mckusick	Update to soft updates journaling to properly track freed blocks that get claimed by snapshots. Submitted by: Jeff Roberson Tested by: Peter Holm
# 218838	19-Feb-2011	kib	v_mountedhere is a member of the union. Check that the vnodes have proper type before using the member. Reported and tested by: Michael Butler <imb protected-networks net>
# 218513	10-Feb-2011	netchild	Wrap long line. Noticed by: bz
# 218485	09-Feb-2011	netchild	Add some FEATURE macros for some UFS features. SU+J is not included as a FEATURE macro: - it was not in the tree during the GSoC - I do not see an option to en-/disable it in NOTES Two minor changes where made during the review compared to what was developed during GSoC 2010. No FreeBSD version bump, the userland application to query the features will be committed last and can serve as an indication of the availablility if needed. Sponsored by: Google Summer of Code 2010 Submitted by: kibab Reviewed by: kib X-MFC after: to be determined in last commit with code from this project
# 216818	30-Dec-2010	kib	Handle missing jremrefs when a directory is renamed overtop of another, deleting it. If the directory is removed, UFS always need to remove the .. ref, even if the ultimate ref on the parent would not change. The new directory must have a new journal entry for that ref. Otherwise journal processing would not properly account for the parent's reference since it will belong to a removed directory entry. Change ufs_rename()'s dotdot rename section to always setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency allocated via setup_dotdot_link(). Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove(). Remove the isdirrem > 1 checks from newdirrem(). Reported by: many Submitted by: jeff Tested by: pho
# 215052	09-Nov-2010	jhb	Remove unused includes of <sys/mutex.h> and <machine/mutex.h>.
# 209717	06-Jul-2010	jeff	- Handle the truncation of an inode with an effective link count of 0 in the context of the process that reduced the effective count. Previously all truncation as a result of unlink happened in the softdep flush thread. This had the effect of being impossible to rate limit properly with the journal code. Now the process issuing unlinks is suspended when the journal files. This has a side-effect of improving rm performance by allowing more concurrent work. - Handle two cases in inactive, one for effnlink == 0 and another when nlink finally reaches 0. - Eliminate the SPACECOUNTED related code since the truncation is no longer delayed. Discussed with: mckusick
# 207141	24-Apr-2010	jeff	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
# 202934	24-Jan-2010	trasz	Move out code that does POSIX.1e ACL inheritance into separate routines. Reviewed by: rwatson
# 200796	21-Dec-2009	trasz	Implement NFSv4 ACL support for UFS. Reviewed by: rwatson
# 197269	17-Sep-2009	brooks	Allocate space for the group array in a static credential used in the quota code. One case was correctly handled in r194498, but this one was missed. PR: kern/138657 Tested by: PR submitter MFC after: 3 days
# 195296	02-Jul-2009	trasz	Fix fpathconf(3) on fifos, in effect making ls(1) properly display '+' on them. Taken from kern/125613, with cosmetic changes. PR: kern/125613 Submitted by: Jaakko Heinonen <jh at saunalahti dot fi> Approved by: re (kib)
# 194498	19-Jun-2009	brooks	Rework the credential code to support larger values of NGROUPS and NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024 and 1023 respectively. (Previously they were equal, but under a close reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it is the number of supplemental groups, not total number of groups.) The bulk of the change consists of converting the struct ucred member cr_groups from a static array to a pointer. Do the equivalent in kinfo_proc. Introduce new interfaces crcopysafe() and crsetgroups() for duplicating a process credential before modifying it and for setting group lists respectively. Both interfaces take care for the details of allocating groups array. crsetgroups() takes care of truncating the group list to the current maximum (NGROUPS) if necessary. In the future, crsetgroups() may be responsible for insuring invariants such as sorting the supplemental groups to allow groupmember() to be implemented as a binary search. Because we can not change struct xucred without breaking application ABIs, we leave it alone and introduce a new XU_NGROUPS value which is always 16 and is to be used or NGRPS as appropriate for things such as NFS which need to use no more than 16 groups. When feasible, truncate the group list rather than generating an error. Minor changes: - Reduce the number of hand rolled versions of groupmember(). - Do not assign to both cr_gid and cr_groups[0]. - Modify ipfw to cache ucreds instead of part of their contents since they are immutable once referenced by more than one entity. Submitted by: Isilon Systems (initial implementation) X-MFC after: never PR: bin/113398 kern/133867
# 194296	16-Jun-2009	kib	Do not use casts (int )0 and (struct thread )0 for the arguments of vn_rdwr, use NULL. Reviewed by: jhb MFC after: 1 week
# 193511	05-Jun-2009	rwatson	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 192895	27-May-2009	jamie	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
# 191564	27-Apr-2009	rmacklem	Change the semantics of i_modrev/va_filerev to what is required for the nfsv4 Change attribute. There are 2 changes: 1 - The value now changes on metadata changes as well as data modifications (incremented for IN_CHANGE instead of IN_UPDATE). 2 - It is now saved in spare space in the on-disk i-node so that it survives a crash. Since va_filerev is not passed out into user space, the only current use of va_filerev is in the nfs server, which uses it as the directory cookie verifier. Since this verifier is only passed back to the server by a client verbatim and then the server doesn't check it, changing the semantics should not break anything currently in FreeBSD. Reviewed by: bde Approved by: kib (mentor)
# 191315	20-Apr-2009	kib	In ufs_checkpath(), recheck that '..' still points to the inode with the same inode number after VFS_VGET() and relock of the vp. If '..' changed, redo the lookup. To reduce code duplication, move the code to read '..' dirent into the static helper function ufs_dir_dd_ino(). Supply the source inode number as an argument to ufs_checkpath() instead of the source inode itself. The inode is unlocked, thus it might be reclaimed, causing accesses to the freed memory. Use vn_vget_ino() to get the '..' vnode by its inode number, instead of directly code VFS_VGET() and relock, to properly busy the mount point while vp lock is dropped. Noted and reviewed by: tegge Tested by: pho MFC after: 1 month
# 191249	18-Apr-2009	trasz	Use acl_alloc() and acl_free() instead of using uma(9) directly. This will make switching to malloc(9) easier; also, it would be neccessary to add these routines if/when we implement variable-size ACLs.
# 187564	21-Jan-2009	jhb	Fix a few style bogons. Submitted by: bde
# 187526	21-Jan-2009	jhb	Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP: VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho
# 186194	16-Dec-2008	trasz	According to phk@, VOP_STRATEGY should never, _ever_, return anything other than 0. Make it so. This fixes "panic: VOP_STRATEGY failed bp=0xc320dd90 vp=0xc3b9f648", encountered when writing to an orphaned filesystem. Reason for the panic was the following assert: KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp)); at vfs_bio:bufstrategy(). Reviewed by: scottl, phk Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
# 184413	28-Oct-2008	trasz	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)
# 184408	28-Oct-2008	kib	Provide an explanation for getinoquota() call in the ufs_access vop. MFC after: 3 days
# 184205	23-Oct-2008	des	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months
# 183212	20-Sep-2008	kib	Initialize va_flags and va_filerev properly in VOP_GETATTR(). Don't initialize va_vaflags and va_spare because they are not part of the VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero. Submitted by: Jaakko Heinonen <jh saunalahti fi> Reviewed by: bde Discussed on: freebsd-fs MFC after: 1 month
# 183078	16-Sep-2008	jhb	vdropl() drops the vnode interlock. Thus, the code in the QUOTA case that upgrades the vnode lock if it is share locked was dropping the interlock before actually checking VI_DOOMED. Fix this by do the vdropl() after the check and relying on it to drop the vnode interlock. Reported by: pho Reviewed by: kib MFC after: 1 week
# 183070	16-Sep-2008	kib	When downgrading the read-write mount to read-only, do_unmount() sets MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive() and ufs_itimes_locked(), UFS verifies whether the fs is read-only by checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag for inode on the fs being remounted rw->ro. Introduce UFS_RDONLY() struct ufsmount' method that reports the value of the fs_ronly. The later is set to 1 only after the remount is finished. Reviewed by: tegge In collaboration with: pho MFC after: 1 month
# 182371	28-Aug-2008	attilio	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 182115	24-Aug-2008	kib	Put the relocked variable from the r182111 into the #ifdef QUOTA braces to prevent warning about unused var on the !QUOTA kernels. Reported by: ed MFC after: 1 week
# 182111	24-Aug-2008	kib	Revert the r167541: "Remove unneeded getinoquota() call in the ufs_access()." The call to getinoquota in ufs_access() serves the purpose of instantiating inode dquot from the vn_open(). Since quotas are accounted only for the inodes with already attached dquot, removal of the call prevented opened inodes from participation in the quota calculations. Since ufs_access() may be called with the vnode being only shared locked, upgrade (and then downgrade) vnode lock if calling getinoquota(). Reported by: simon at optinet com In collaboration with: pho MFC after: 1 week
# 178243	16-Apr-2008	kib	Move the head of byte-level advisory lock list from the filesystem-specific vnode data to the struct vnode. Provide the default implementation for the vop_advlock and vop_advlockasync. Purge the locks on the vnode reclaim by using the lf_purgelocks(). The default implementation is augmented for the nfs and smbfs. In the nfs_advlock, push the Giant inside the nfs_dolock. Before the change, the vop_advlock and vop_advlockasync have taken the unlocked vnode and dereferenced the fs-private inode data, racing with with the vnode reclamation due to forced unmount. Now, the vop_getattr under the shared vnode lock is used to obtain the inode size, and later, in the lf_advlockasync, after locking the vnode interlock, the VI_DOOMED flag is checked to prevent an operation on the doomed vnode. The implementation of the lf_purgelocks() is submitted by dfr. Reported by: kris Tested by: kris, pho Discussed with: jeff, dfr MFC after: 2 weeks
# 177633	26-Mar-2008	dfr	Add the new kernel-mode NFS Lock Manager. To use it instead of the user-mode lock manager, build a kernel with the NFSLOCKD option and add '-k' to 'rpc_lockd_flags' in rc.conf. Highlights include: * Thread-safe kernel RPC client - many threads can use the same RPC client handle safely with replies being de-multiplexed at the socket upcall (typically driven directly by the NIC interrupt) and handed off to whichever thread matches the reply. For UDP sockets, many RPC clients can share the same socket. This allows the use of a single privileged UDP port number to talk to an arbitrary number of remote hosts. * Single-threaded kernel RPC server. Adding support for multi-threaded server would be relatively straightforward and would follow approximately the Solaris KPI. A single thread should be sufficient for the NLM since it should rarely block in normal operation. * Kernel mode NLM server supporting cancel requests and granted callbacks. I've tested the NLM server reasonably extensively - it passes both my own tests and the NFS Connectathon locking tests running on Solaris, Mac OS X and Ubuntu Linux. * Userland NLM client supported. While the NLM server doesn't have support for the local NFS client's locking needs, it does have to field async replies and granted callbacks from remote NLMs that the local client has contacted. We relay these replies to the userland rpc.lockd over a local domain RPC socket. * Robust deadlock detection for the local lock manager. In particular it will detect deadlocks caused by a lock request that covers more than one blocking request. As required by the NLM protocol, all deadlock detection happens synchronously - a user is guaranteed that if a lock request isn't rejected immediately, the lock will eventually be granted. The old system allowed for a 'deferred deadlock' condition where a blocked lock request could wake up and find that some other deadlock-causing lock owner had beaten them to the lock. * Since both local and remote locks are managed by the same kernel locking code, local and remote processes can safely use file locks for mutual exclusion. Local processes have no fairness advantage compared to remote processes when contending to lock a region that has just been unlocked - the local lock manager enforces a strict first-come first-served model for both local and remote lockers. Sponsored by: Isilon Systems PR: 95247 107555 115524 116679 MFC after: 2 weeks
# 175294	13-Jan-2008	attilio	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# 175202	09-Jan-2008	attilio	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# 173464	08-Nov-2007	obrien	Turn most ffs 'DIAGNOSTIC's into INVARIANTS.
# 172930	24-Oct-2007	rwatson	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 170587	11-Jun-2007	rwatson	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
# 169898	23-May-2007	pjd	Eliminate VI_LOCK()/VI_UNLOCK() pair from getattr and close code paths. It's hard to measure performance improvement on my test machine, but the change won't degrade performance for sure. I can measure slight improvement for debugging kernel and it can also be a win for machines where atomic operation is more expensive. Reviewed by: kib
# 167541	14-Mar-2007	kib	Remove unneeded getinoquota() call in the ufs_access(). Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith)
# 167154	01-Mar-2007	pjd	Change: "... try to use VADMIN in preference to VADMIN ..." To: "... try to use VADMIN in preference to VWRITE ..."
# 167152	01-Mar-2007	pjd	Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better describe the privilege. OK'ed by: rwatson
# 167151	01-Mar-2007	pjd	Avoid checking for privileges if there is no need to. Discussed with: rwatson
# 166564	08-Feb-2007	kib	Remove not needed acquision of the mount interlock aroung reading of mnt_kern_flags in ufs_itimes(). Suggested by: ssouhlal Confirmed by: tegge MFC after: 2 weeks
# 166052	16-Jan-2007	mpp	Fix a spelling error. heirarchy -> hierarchy. Obtained from: OpenBSD
# 164033	06-Nov-2006	rwatson	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# 163841	31-Oct-2006	pjd	Add gjournal specific code to the UFS file system: - Add FS_GJOURNAL flag which enables gjournal support on a file system. - Add cg_unrefs field to the cylinder group structure which holds number of unreferenced (orphaned) inodes in the given cylinder group. - Add fs_unrefs field to the super block structure which holds total number of unreferenced (orphaned) inodes. - When file or a directory is orphaned (last reference is removed, but object is still open), increase fs_unrefs and cg_unrefs fields, which is a hint for fsck in which cylinder groups looks for such (orphaned) objects. - When file is last closed, decrease {fs,cg}_unrefs fields. - Add VV_DELETED vnode flag which points at orphaned objects. Sponsored by: home.pl
# 163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 163194	10-Oct-2006	kib	Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem is suspending/suspended. Doing so may result in deadlock. Instead, set the (new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted. Change the locking protocol in order to set the IN_ACCESS and timestamps without upgrading shared vnode lock to exclusive (see comments in the inode.h). Before that, inode was modified while holding only shared lock. Tested by: Peter Holm Reviewed by: tegge, bde Approved by: pjd (mentor) MFC after: 3 weeks
# 162942	02-Oct-2006	tegge	Correct check for when IO_SYNC should be set for filesystem not using softupdates when truncating a directory to zero length. Discussed with: bde
# 161473	20-Aug-2006	pjd	Correct typo in comment.
# 159109	31-May-2006	maxim	o Rearrange and remove incorrect comments. Requested by: bde
# 159102	31-May-2006	maxim	o According to POSIX, the result of ftruncate(2) is unspecified for file types other than VREG, VDIR and shared memory objects. We already handle VREG, VLNK and VDIR cases. Silently ignore truncate requests for all the rest. Adjust comments. PR: kern/98064 Submitted by: bde Security: local DoS Regress. test: regression/fifo/fifo_misc MFC after: 2 weeks
# 156897	19-Mar-2006	tegge	Add kludge to avoid deadlock when unlinking snapshot.
# 151252	12-Oct-2005	dds	Move execve's access time update functionality into a new vfs_mark_atime() function, and use the new function for performing efficient atime updates in mmap(). Reviewed by: bde MFC after: 2 weeks
# 150634	27-Sep-2005	jhb	Use the refcount API to manage the reference count for user credentials rather than using pool mutexes. Tested on: i386, alpha, sparc64
# 149811	05-Sep-2005	csjp	Convert the primary ACL allocator from malloc(9) to using a UMA zone instead. Also introduce an aclinit function which will be used to create the UMA zone for use by file systems at system start up. MFC after: 1 month Discussed with: rwatson
# 147198	09-Jun-2005	ssouhlal	Allow EVFILT_VNODE events to work on every filesystem type, not just UFS by: - Making the pre and post hooks for the VOP functions work even when DEBUG_VFS_LOCKS is not defined. - Moving the KNOTE activations into the corresponding VOP hooks. - Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct mount that permits filesystems to disable the new behavior. - Creating a default VOP_KQFILTER function: vfs_kqfilter() My benchmarks have not revealed any performance degradation. Reviewed by: jeff, bde Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
# 146829	31-May-2005	kensmith	This patch addresses a standards violation issue. The standards say a file's access time should be updated when it gets executed. A while ago the mechanism used to exec was changed to use a more mmap based mechanism and this behavior was broken as a side-effect of that. A new vnode flag is added that gets set when the file gets executed, and the VOP_SETATTR() vnode operation gets called. The underlying filesystem is expected to handle it based on its own semantics, some filesystems don't support access time at all. Those that do should handle it in a way that does not block, does not generate I/O if possible, etc. In particular vn_start_write() has not been called. The UFS code handles it the same way as it would normally handle the access time if a file was read - the IN_ACCESS flag gets set in the inode but no other action happens at this point. The actual time update will happen later during a sync (which handles all the necessary locking). Got me into this: cperciva Discussed with: a lot with bde, a little with kan Showed patches to: phk, jeffr, standards@, arch@ Minor discussion on: arch@
# 146356	18-May-2005	mckusick	Allow removal of empty directories with high link counts. These can occur on a filesystem running with soft updates after a crash and before a background fsck has been run. To prevent discrepancies from arising in a background fsck that may already be running, the directory is removed but its inode is not freed and is left with the residual reference count. When encountered by the background fsck it will be reclaimed.
# 145138	16-Apr-2005	pjd	- Plug memory leak. - Fix two style nits. Found by: Coverity Prevent analysis tool Reviewed by: rwatson MFC after: 1 week
# 143500	13-Mar-2005	jeff	- In ufs_mknod(), hold the lock across the call to vgone() as that is now required. - In ufs_close(), don't do the EAGAIN vrele hack, the top layer now calls vn_start_write before the lock is acquired as it should. Sponsored by: Isilon Systems, Inc.
# 142692	27-Feb-2005	phk	Remove debug printout of major/minor numbers, print name instead.
# 142682	27-Feb-2005	sam	use uiomove return value instead of always returning 0 when doing a readlink of a fast link Noticed by: Coverity Prevent analysis tool Reviewed by: phk
# 141543	08-Feb-2005	cperciva	Add a new sysctl, "security.jail.chflags_allowed", which controls the behaviour of chflags within a jail. If set to 0 (the default), then a jailed root user is treated as an unprivileged user; if set to 1, then a jailed root user is treated the same as an unjailed root user. This is necessary to allow "make installworld" to work inside a jail, since it attempts to manipulate the system immutable flag on certain files. Discussed with: csjp, rwatson MFC after: 2 weeks
# 141521	08-Feb-2005	phk	For snapshots we need all VOP_LOCKs to be exclusive. The "business class upgrade" was implemented in UFS's VOP_LOCK implementation ufs_lock() which is the wrong layer, so move it to ffs_lock(). Also, as long as we have not abandonned advanced vfs-stacking we should not preclude it from happening: instead of implementing a copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at vop_stdlock() at the bottom.
# 141143	02-Feb-2005	kensmith	Back out previous commit, bde@ provided an example of something this breaks.
# 141130	01-Feb-2005	kensmith	It was noticed that we do not change a file's access time when it gets executed. This appears to violate most of the UNIX-ish standards. One example quote from: http://www.opengroup.org/onlinepubs/009695399/functions/exec.html Upon successful completion, the exec functions shall mark for update the st_atime field of the file. If an exec function failed but was able to locate the process image file, whether the st_atime field is marked for update is unspecified. Should the exec function succeed, the process image file shall be considered to have been opened with open(). This appears to take care of it for ufs filesystems, doing the necessary sanity checks (read-only filesystem, etc) without violating any other standards (setting atime for any open appears to be allowed in any standards I could find). Noticed by: cperciva Reviewed by: kan, rwatson
# 140962	29-Jan-2005	peadar	Tell vnode_create_vobject() how big an object to create, rather than having it work it out via the more expensive VOP_GETATTR Reviewed by: phk@
# 140778	24-Jan-2005	phk	Create a vnode object when the file is opened. Trust that we did so.
# 140729	24-Jan-2005	phk	Polish style.
# 140051	11-Jan-2005	phk	Wrap the bufobj operations in macros: BO_STRATEGY() and BO_WRITE()
# 139825	07-Jan-2005	imp	/* -> /*- for license, minor formatting changes
# 138868	14-Dec-2004	phk	Implement simpler panics for VOP_{read,write} on fifos.
# 138700	11-Dec-2004	marcel	Revert previous commit. The null-pointer function call (a dereference on ia64) was not the result of a change in the vector operations. It was caused by the NFS locking code using a FIFO and those bypassing the vnode. This indirectly caused the panic. The NFS locking code has been changed. Requested by: phk
# 138411	05-Dec-2004	marcel	Fix null-pointer indirect function calls introduced in the previous commit. In the new world order, the transitive closure on the vector operations is not precomputed. As such, it's unsafe to actually use any of the function pointers in an indirect function call. They can be null, and we need to use the default vector in that case. This is mostly a quick fix for the four function pointers that are ed explicitly. A more generic or scalable solution is likely to see the light of day. No pathos on: current@
# 138290	01-Dec-2004	phk	Back when VOP_* was introduced, we did not have new-style struct initializations but we did have lofty goals and big ideals. Adjust to more contemporary circumstances and gain type checking. Replace the entire vop_t frobbing thing with properly typed structures. The only casualty is that we can not add a new VOP_ method with a loadable module. History has not given us reason to belive this would ever be feasible in the the first place. Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc. Give coda correct prototypes and function definitions for all vop_()s. Generate a bit more data from the vnode_if.src file: a struct vop_vector and protype typedefs for all vop methods. Add a new vop_bypass() and make vop_default be a pointer to another struct vop_vector. Remove a lot of vfs_init since vop_vector is ready to use from the compiler. Cast various vop_mumble() to void * with uppercase name, for instance VOP_PANIC, VOP_NULL etc. Implement VCALL() by making vdesc_offset the offsetof() the relevant function pointer in vop_vector. This is disgusting but since the code is generated by a script comparatively safe. The alternative for nullfs etc. would be much worse. Fix up all vnode method vectors to remove casts so they become typesafe. (The bulk of this is generated by scripts)
# 138270	01-Dec-2004	phk	Mechanically change prototypes for vnode operations to use the new typedefs.
# 137308	06-Nov-2004	phk	Properly implement a default version of VOP_GETWRITEMOUNT. Remove improper access to vop_stdgetwritemount() which should and will instead rely on the VOP default path.
# 137035	29-Oct-2004	phk	Move UFS from DEVFS backing to GEOM backing. This eliminates a bunch of vnode overhead (approx 1-2 % speed improvement) and gives us more control over the access to the storage device. Access counts on the underlying device are not correctly tracked and therefore it is possible to read-only mount the same disk device multiple times: syv# mount -p /dev/md0 /var ufs rw 2 2 /dev/ad0 /mnt ufs ro 1 1 /dev/ad0 /mnt2 ufs ro 1 1 /dev/ad0 /mnt3 ufs ro 1 1 Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches things in RAM) this is not possible with read-write mounts, and the system will correctly reject this. Details: Add a geom consumer and a bufobj pointer to ufsmount. Eliminate the vnode argument from softdep_disk_prewrite(). Pick the vnode out of bp->b_vp for now. Eventually we should find it through bp->b_bufobj->b_private. In the mountcode, use g_vfs_open() once we have used VOP_ACCESS() to check permissions. When upgrading and downgrading between r/o and r/w do the right thing with GEOM access counts. Remove all the workarounds for not being able to do this with VOP_OPEN(). If we are the root mount, drop the exclusive access count until we upgrade to r/w. This allows fsck of the root filesystem and the MNT_RELOAD to work correctly. Set bo_private to the GEOM consumer on the device bufobj. Change the ffs_ops->strategy function to call g_vfs_strategy() In ufs_strategy() directly call the strategy on the disk bufobj. Same in rawread. In ffs_fsync() we will no longer see VCHR device nodes, so remove code which synced the filesystem mounted on it, in case we came there. I'm not sure this code made sense in the first place since we would have taken the specfs route on such a vnode. Redo the highly bogus readblock() function in the snapshot code to something slightly less bogus: Constructing an uio and using physio was really quite a detour. Instead just fill in a bio and ship it down.
# 136988	27-Oct-2004	phk	Eliminate unnecessary KASSERTS.
# 136980	26-Oct-2004	phk	Replace single case switch() with if().
# 136969	26-Oct-2004	phk	The island council met and voted buf_prewrite() home. Give ffs it's own bufobj->bo_ops vector and create a private strategy routine, (currently misnamed for forwards compatibility), which is just a copy of the generic bufstrategy routine except we call softdep_disk_prewrite() directly instead of through the buf_prewrite() indirection. Teach UFS about the need for softdep_disk_prewrite() and call the function directly in FFS. Remove buf_prewrite() from the default bufstrategy() and from the global bio_ops method vector.
# 135877	28-Sep-2004	phk	Remove support for accessing device nodes in UFS/FFS. Device nodes can still be created and exported with NFS.
# 134899	07-Sep-2004	phk	Create simple function init_va_filerev() for initializing a va_filerev field. Replace three instances of longhaired initialization va_filerev fields. Added XXX comment wondering why we don't use random bits instead of uptime of the system for this purpose.
# 134143	22-Aug-2004	csjp	Currently, if the secure level is low enough, system flags can be manipulated by prison root. In 4.x prison root can not manipulate system flags, regardless of the security level. This behavior should remain consistent to avoid any surprises which could lead to security problems for system administrators which give out privileged access to jails. This commit changes suser_cred's flag argument from SUSER_ALLOWJAIL to 0. This will prevent prison root from being able to manipulate system flags on files. This may be a MFC candidate for RELENG_5. Discussed with: cperciva Reviewed by: rwatson Approved by: bmilekic (mentor) PR: kern/70298
# 133741	15-Aug-2004	jmg	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 132775	28-Jul-2004	kan	Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper inode field based on UFS version. Use DIP ro read values and DIP_SET to modify them throughout FFS code base.
# 132653	26-Jul-2004	cperciva	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
# 127975	07-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and irc message from Robert Watson saying that clause 3 can be removed from those files with an NAI copyright that also have only a University of California copyrights. Approved by: core, rwatson
# 126858	11-Mar-2004	phk	When I was a kid my work table was one cluttered mess an cleaning it up were a rather overwhelming task. I soon learned that if you don't know where you're going to store something, at least try to pile it next to something slightly related in the hope that a pattern emerges. Apply the same principle to the ffs/snapshot/softupdates code which have leaked into specfs: Add yet a buf-quasi-method and call it from the only two places I can see it can make a difference and implement the magic in ffs_softdep.c where it belongs. It's not pretty, but at least it's one less layer violated.
# 126853	11-Mar-2004	phk	Properly vector all bwrite() and BUF_WRITE() calls through the same path and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
# 126170	23-Feb-2004	mckusick	A more accurate test in the new ufs_lock than that in 1.235.
# 126153	23-Feb-2004	mckusick	Change UFS from using vop_stdlock to using its own ufs_lock. In ufs_lock, check for attempts to acquire shared locks on snapshot files and change them to be exclusive locks. This change eliminates deadlocks and machine lockups reported in -current since most read requests started using shared lock requests. Submitted by: Jun Kuriyama <kuriyama@imgsrc.co.jp>
# 121205	18-Oct-2003	phk	DuH! bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in the file)
# 121202	18-Oct-2003	phk	Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY()
# 118411	04-Aug-2003	rwatson	Now that the central POSIX.1e ACL code implements functions to generate the inode mode from a default ACL and creation mask, implement ufs_sync_inode_from_acl() using acl_posix1e_newfilemode(). Since ACL_OVERRIDE_MASK/ACL_PRESERVE_MASK are defined, we no longer need to explicitly pass in a "preserve_mask" field: this is implicit in the use of POSIX.1e semantics. Note: this change contains a semantic bugfix for new file creation: we now intersect the ACL-generated mode and the cmode requested by the user process. This means permissions on newly created file objects will now be more conservative. In the future, we may want to provide alternative semantics (similar to Solaris and Linux) in which the ACL mask overrides the umask, permitting ACLs to broaden the rights beyond the requested umask. PR: 50148 Reported by: Ritz, Bruno <bruno_ritz@gmx.ch> Obtained from: TrustedBSD Project
# 118404	03-Aug-2003	rwatson	In ufs_chmod(), use privilege only when required in the following cases: - Setting sticky bit on non-directory - Setting setgid on a file with a group that isn't in the effective or extended groups of the authorizing credential I.e., test the requirement first, then do the privilege test, rather than doing the privilege test regardless of the need for privilege. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 118131	28-Jul-2003	rwatson	Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the kernel ACL interfaces and system call names. Break out UFS2 and FFS extattr delete and list vnode operations from setextattr and getextattr to deleteextattr and listextattr, which cleans up the implementations, and makes the results more readable, and makes the APIs more clear. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 117221	04-Jul-2003	phk	We just cached the inode pointer, no need to call VTOI() again.
# 116412	15-Jun-2003	phk	Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations to check that the buffer points to the correct vnode.
# 116192	11-Jun-2003	obrien	Use __FBSDID().
# 115526	31-May-2003	phk	Remove unused variable. Found by: FlexeLint
# 111841	03-Mar-2003	njl	Finish cleanup of vprint() which was begun with changing v_tag to a string. Remove extraneous uses of vop_null, instead defering to the default op. Rename vnode type "vfs" to the more descriptive "syncer". Fix formatting for various filesystems that use vop_print.
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 108686	04-Jan-2003	phk	Temporarily introduce a new VOP_SPECSTRATEGY operation while I try to sort out disk-io from file-io in the vm/buffer/filesystem space. The intent is to sort VOP_STRATEGY calls into those which operate on "real" vnodes and those which operate on VCHR vnodes. For the latter kind, the call will be changed to VOP_SPECSTRATEGY, possibly conditionally for those places where dual-use happens. Add a default VOP_SPECSTRATEGY method which will call the normal VOP_STRATEGY. First time it is called it will print debugging information. This will only happen if a normal vnode is passed to VOP_SPECSTRATEGY by mistake. Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY does on a VCHR vnode today. Add a new VOP_STRATEGY method in specfs to catch instances where the conversion to VOP_SPECSTRATEGY has not yet happened. Handle the request just like we always did, but first time called print debugging information. Apart up to two instances of console messages per boot, this amounts to a glorified no-op commit. If you get any of the messages on your console I would very much like a copy of them mailed to phk@freebsd.org
# 108648	04-Jan-2003	phk	Since Jeffr made the std* functions the default in rev 1.63 of kern/vfs_defaults.c it is wrong for the individual filesystems to use the std* functions as that prevents override of the default. Found by: src/tools/tools/vop_table
# 106058	27-Oct-2002	wollman	Implement the new 1003.1-2001 pathconf() keys, including the Advisory Information option. Other filesystem implementations should do something similar. With advice from: mckusick, phk
# 105988	26-Oct-2002	rwatson	Slightly change the semantics of vnode labels for MAC: rather than "refreshing" the label on the vnode before use, just get the label right from inception. For single-label file systems, set the label in the generic VFS getnewvnode() code; for multi-label file systems, leave the labeling up to the file system. With UFS1/2, this means reading the extended attribute during vfs_vget() as the inode is pulled off disk, rather than hitting the extended attributes frequently during operations later, improving performance. This also corrects sematics for shared vnode locks, which were not previously present in the system. This chances the cache coherrency properties WRT out-of-band access to label data, but in an acceptable form. With UFS1, there is a small race condition during automatic extended attribute start -- this is not present with UFS2, and occurs because EAs aren't available at vnode inception. We'll introduce a work around for this shortly. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 105572	20-Oct-2002	rwatson	Rename _POSIX_FOO_PRESENT and friends from POSIX.1e to _PC_FOO_PRESENT and related friends. This would have been corrected had POSIX.1e progressed to a standard. Pointed out by: wollman
# 105571	20-Oct-2002	rwatson	Implement _POSIX_ACL_PATH_MAX, which returns the maximum number of ACL entries for a file system node using pathconf(). Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 105567	20-Oct-2002	rwatson	Teach UFS to respond to pathconf() tests for _POSIX_ACL_EXTENDED and _POSIX_MAC_PRESENT based on available mount flags, if the services are available. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 105415	18-Oct-2002	rwatson	Use 'size_t' instead of 'int' for the result of sizeof().
# 105179	15-Oct-2002	rwatson	Push most UFS ACL behavior behind a check for MNT_ACLS, permitting ACLs to be administratively disabled as needed on UFS/UFS2 file systems. This also has the effect of preventing the slightly more expensive ACL code from running on non-ACL file systems, avoiding storage allocation for ACLs that may be read from disk. MNT_ACLS may be set at mount-time using mount -o acls, or implicitly by setting the FS_ACLS flag using tunefs. On UFS1, you may also have to configure ACL store. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 105136	14-Oct-2002	mckusick	When reading or writing the extended attributes of a special device or fifo in UFS2, the normal ufs_strategy routine needs to be used rather than the spec_strategy or fifo_strategy routine. Thus the ffsext_strategy routine is interposed in the ffs_vnops vectors for special devices and fifo's to pick off this special case. Otherwise it simply falls through to the usual spec_strategy or fifo_strategy routine. Submitted by: Robert Watson <rwatson@FreeBSD.org> Sponsored by: DARPA & NAI Labs.
# 105123	14-Oct-2002	rwatson	Fix two memory leaks in error conditions involving the UFS ACL code: if failures occur, make sure that we release both the default ACL and access ACL storage during new object creation. Spotted by: phk and his pet flexelint Sponsored by: DARPA, Network Associates Laboratories
# 104908	11-Oct-2002	mike	Change iov_base's type from `char ' to the standard `void '. All uses of iov_base which assume its type is `char ' (in order to do pointer arithmetic) have been updated to cast iov_base to `char '.
# 104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 103944	25-Sep-2002	jeff	- Lock accesses to v_usecount. - Convert interlock locks to use standard macros.
# 103636	19-Sep-2002	truckman	VOP_FSYNC() requires that it's vnode argument be locked, which nfs_link() wasn't doing. Rather than just lock and unlock the vnode around the call to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode in kern_link() before calling VOP_LINK(), since the other filesystems also locked the file vnode right away in their link methods. Remove the locking and and unlocking from the leaf filesystem link methods. Reviewed by: rwatson, bde (except for the unionfs_link() changes)
# 103559	18-Sep-2002	njl	Remove any VOP_PRINT that redundantly prints the tag. Move lockmgr_printinfo() into vprint() for everyone's benefit. Suggested by: bde
# 103314	14-Sep-2002	njl	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
# 103180	10-Sep-2002	bde	vfs_syscalls.c: Changed rename(2) to follow the letter of the POSIX spec. POSIX requires rename() to have no effect if its args "resolve to the same existing file". I think "file" can only reasonably be read as referring to the inode, although the rationale and "resolve" seem to say that sameness is at the level of (resolved) directory entries. ext2fs_vnops.c, ufs_vnops.c: Replaced code that gave the historical BSD behaviour of removing one link name by checks that this code is now unreachable. This fixes some races. All vnodes needed to be unlocked for the removal, and locking at another level using something like IN_RENAME was not even attempted, so it was possible for rename(x, y) to return with both x and y removed even without any unlink(2) syscalls (one process can remove x using rename(x, y) and another process can remove y using rename(y, x)). Prodded by: alfred MFC after: 8 weeks PR: 42617
# 101941	15-Aug-2002	rwatson	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101744	12-Aug-2002	rwatson	Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent enforcement of MAC policy on the read or write operations: - In ext2fs, don't enforce MAC on loop-back reads and writes supporting directory read operations in lookup(), directory modifications in rename(), directory write operations in mkdir(), symlink write operations in symlink(). - In the NFS client locking code, perform vn_rdwr() on the NFS locking socket without enforcing MAC, since the write is done on behalf of the kernel NFS implementation rather than the user process. - In UFS, don't enforce MAC on loop-back reads and writes supporting directory read operations in lookup(), and symlink write operations in symlink(). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101308	04-Aug-2002	jeff	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# 101073	31-Jul-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument UFS to support per-inode MAC labels. In particular, invoke MAC framework entry points for generically supporting the backing of MAC labels into extended attributes. This ends up introducing new vnode operation vector entries point at the MAC framework entry points, as well as some explicit entry point invocations for file and directory creation events so that the MAC framework can push labels to disk before the directory names become persistent (this will work better once EAs in UFS2 are hooked into soft updates). The generic EA MAC entry points support executing with the file system in either single label or multilabel operation, and will fall back to the mount label if multilabel is not specified at mount-time. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 100344	19-Jul-2002	mckusick	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
# 100207	17-Jul-2002	mckusick	Change utimes to set the file creation time (for filesystems that support creation times such as UFS2) to the value of the modification time if the value of the modification time is older than the current creation time. See utimes(2) for further details. Sponsored by: DARPA & NAI Labs.
# 100201	16-Jul-2002	mckusick	Change the name of st_createtime to st_birthtime. This change is made to reduce confusion between st_ctime and st_createtime. Submitted by: Eric Allman <eric@sendmail.org> Sponsored by: DARPA & NAI Labs.
# 98658	23-Jun-2002	dillon	Rename the BALLOC flags from B_* to BA_* to avoid confusion with the struct buf B_ flags. Approved by: mckusick
# 98542	21-Jun-2002	mckusick	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
# 96873	18-May-2002	iedowse	Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u unions, since these were only necessary when ext2fs used ufs code. Reviewed by: mckusick
# 96820	17-May-2002	phk	Call ufs_bmaparray() with right parameter type. Sponsored by: DARPA & NAI Labs.
# 96755	16-May-2002	trhodes	More s/file system/filesystem/g
# 96506	13-May-2002	phk	Remove register keyword. Sponsored by: DARPA & NAI Labs. Submitted by: mckusick
# 95974	03-May-2002	phk	Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and put then in the ufs_vnops where they belong, rather than in the ffs_vnops. Ok'ed by: rwatson Sponsored by: DARPA & NAI Labs.
# 95945	02-May-2002	phk	Use vop_panic() instead of our home-rolled version.
# 93593	01-Apr-2002	jhb	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 92728	19-Mar-2002	alfred	Remove __P.
# 92462	16-Mar-2002	mckusick	Add a flags parameter to VFS_VGET to pass through the desired locking flags when acquiring a vnode. The immediate purpose is to allow polling lock requests (LK_NOWAIT) needed by soft updates to avoid deadlock when enlisting other processes to help with the background cleanup. For the future it will allow the use of shared locks for read access to vnodes. This change touches a lot of files as it affects most filesystems within the system. It has been well tested on FFS, loopback, and CD-ROM filesystems. only lightly on the others, so if you find a problem there, please let me (mckusick@mckusick.com) know.
# 92363	15-Mar-2002	mckusick	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
# 90860	18-Feb-2002	phk	Make v_addpollinfo() visible and non-inline. Have callers only call it as needed. Add necessary call in ufs_kqfilter(). Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>
# 90791	17-Feb-2002	phk	Move the stuff related to select and poll out of struct vnode. The use of the zone allocator may or may not be overkill. There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need to revisit. This shaves about 60 bytes of struct vnode which on my laptop means 600k less RAM used for vnodes.
# 90790	17-Feb-2002	phk	Collect the VN_KNOTE() macro definitions on vnode.h
# 89384	15-Jan-2002	mckusick	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>
# 86782	22-Nov-2001	guido	When mkdir()-ing, the parent dir gets is linkcount increased. Fix VN_KNOTE to reflect that. Found by: tobez@freebsd.org MFC after: 2 days
# 84642	07-Oct-2001	dillon	Remove panics for rename() race conditions. The panics are inappropriate because the IN_RENAME flag only fixes a few of the huge number of race conditions that can result in the source path becoming invalid even prior to the VOP_RENAME() call. The panics created a serious security issue whereby an attacker could fairly easily cause the panic to occur, crashing the machine. The correct solution requires a great deal of work in the namei path cache code. MFC after: 0 days
# 84344	02-Oct-2001	dillon	Backout the last commit. The problem is actually much worse then I first thought and may require serious work to the VOP_RENAME() api itself. Basically, by the time the VOP_RENAME() function is called, it's already too late.
# 84339	02-Oct-2001	dillon	IN_RENAME should only be cleared by the routine that set it. This fixes a rename/rmdir race that has been shown to cause a panic. Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com> MFC after: 3 days
# 83992	26-Sep-2001	rwatson	o Re-enable support of system file flags in jail() by adding back the PRISON_ROOT to the suser_xxx() check. Since securelevels may now be raised in specific jails, use of system flags can still be restricted in jail(), but in a more configurable way. o Users of jail() expecting system flags (such as schg) to restrict jail()'s should be sure to set the securelevel appropriately in jail()'s. o This fixes activities involving automated system flag removal in jail(), including installkernel and friends. Obtained from: TrustedBSD Project
# 83987	26-Sep-2001	rwatson	o Modify ufs_setattr() so that it uses securelevel_gt() instead of direct variable access. Obtained from: TrustedBSD Project
# 83924	25-Sep-2001	rwatson	o Further clarify comment: ad Udo's request, re-insert the 'if' refering to securelevels; also, update the unprivileged process text to better indicate the scope of actions permittable when any system flags are already set (limited). Submitted by: Udo Schweigert <udo.schweigert@siemens.com>
# 83918	25-Sep-2001	rwatson	o Parallelize the comment on the relationship between privileged un-jailed processes and the actual securelevel check: make the comment use '> 0' instead of inverted '<= 0'.
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 82395	27-Aug-2001	peter	If a file has been completely unlinked, stop automatically syncing the file. ffs will discard any pending dirty pages when it is closed, so we may as well not waste time trying to clean them. This doesn't stop other things from writing it out, eg: pageout, fsync(2) etc.
# 82364	26-Aug-2001	iedowse	Stop using dirhash when a directory is removed, and ensure that we never attempt to hash directories once they are deleted. This fixes a problem where operations on a deleted directory could trigger dirhash sanity panics.
# 80554	29-Jul-2001	iedowse	Two recent commits in sys/ufs/ufs interacted badly with ext2fs because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink is invalid because ext2fs does not maintain this field. In ufs_close(), i_effnlink is also tested, to determines whether or not to call vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of ext2fs filesystems; I believe the other is harmless. Fix both cases by checking um_i_effnlink_valid in the ufsmount struct, and use i_nlink if necessary. Noticed by: bde Reviewed by: mckusick, bde
# 77822	06-Jun-2001	jlemon	Add a wrapper for the fifo kqfilter which falls through to the ufs routine. This permits the fifo to inherit the ufs VNODE kqfilter.
# 77762	05-Jun-2001	jlemon	Add a kqueue filter for writing to ufs filesystems which always returns true. This permits better interoperability with programs which register filters on their stdin/stdout handles. Submitted by: Niels Provos <provos@citi.umich.edu>
# 77031	23-May-2001	ru	- FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file systems were repo-copied from sys/miscfs to sys/fs. - Renamed the following file systems and their modules: fdesc -> fdescfs, portal -> portalfs, union -> unionfs. - Renamed corresponding kernel options: FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS. - Install header files for the above file systems. - Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland Makefiles.
# 76174	01-May-2001	phk	Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
# 76132	29-Apr-2001	phk	VOP_BALLOC was never really a VOP in the first place, so convert it to UFS_BALLOC like the other "between UFS and FFS function interfaces".
# 76117	29-Apr-2001	grog	Revert consequences of changes to mount.h, part 2. Requested by: bde
# 75943	25-Apr-2001	mckusick	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.
# 75858	23-Apr-2001	grog	Correct #includes to work with fixed sys/mount.h.
# 75571	17-Apr-2001	rwatson	In my first reading of POSIX.1e, I misinterpreted handling of the ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the access ACL could be used by privileged processes to change file/directory ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and ACL_OTHER) should have undefined ae_id fields; this commit attempts to correct that misunderstanding. o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid associated with the vnode, as those can no longer be extracted from the ACL passed as an argument. Perform all comparisons against the passed arguments. This actually has the effect of simplifying a number of components of this call, as well as reducing the indent level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP. o Modify acl_posix1e_check() to return EINVAL if the ae_id field of any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value other than ACL_UNDEFINED_ID. As a temporary work-around to allow clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before each check so that this cannot cause a failure in the short term (this work-around will be removed when the userland libraries and utilities are updated to take this change into account). o Modify ufs_sync_acl_from_inode() so that it forces ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID when synchronizing the ACL from the inode. o Modify ufs_sync_inode_from_acl to not propagate uid and gid information to the inode from the ACL during ACL update. Also modify the masking of permission bits that may be set from ALLPERMS to (S_IRWXU\|S_IRWXG\|S_IRWXO), as ACLs currently do not carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT). o Modify ufs_getacl() so that when it emulates an access ACL from the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID. o Clean up ufs_setacl() substantially since it is no longer possible to perform chown/chgrp operations using vop_setacl(), so all the access control for that can be eliminated. o Modify ufs_access() so that it passes owner uid and gid information into vaccess_acl_posix1e(). Pointed out by: jedger Obtained from: TrustedBSD Project
# 74822	26-Mar-2001	rwatson	Introduce support for POSIX.1e ACLs on UFS-based file systems. This implementation is still experimental, and while fairly broadly tested, is not yet intended for production use. Support for POSIX.1e ACLs on UFS will not be MFC'd to RELENG_4. This implementation works by providing implementations of VOP_[GS]ETACL() for FFS, as well as modifying the appropriate access control and file creation routines. In this implementation, ACLs are backed into extended attributes; the base ACL (owner, group, other) permissions remain in the inode for performance and compatibility reasons, so only the extended and default ACLs are placed in extended attributes. The logic for ACL evaluation is provided by the fs-independent kern/kern_acl.c. o Introduce UFS_ACL, a compile-time configuration option that enables support for ACLs on FFS (and potentially other UFS-based file systems). o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which respectively get, set, and check the ACLs on the passed vnode. o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to maintain access control information between inode permissions and extended attribute data. o Modify ufs_access() to load a file access ACL and invoke vaccess_acl_posix1e() if ACLs are available on the file system o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly created directories and files, inheriting from the parent directory's default ACL. o Enable these new vnode operations and conditionally compiled code paths if UFS_ACL is defined. A few notes: o This implementation is fairly widely tested, but still should be considered experimental. o Currently, ACLs are not exported via NFS, instead, the summarizing file mode/etc from the inode is. This results in conservative protection behavior, similar to the behavior of ACL-nonaware programs acting locally. o It is possible that underlying binary data formats associated with this implementation may change. Consumers of the implementation should expect to find their local configuration obsoleted in the next few months, resulting in possible loss of ACL data during an upgrade. o The extended attributes interface and implementation is still undergoing modification to address portable interface concerns, as well as performance. o Many applications do not yet correctly handle ACLs. In general, due to the POSIX.1e ACL model, behavior of ACL-unaware applications will be conservative with respects to file protection; some caution is recommended. o Instructions for configuring and maintaining ACLs on UFS will be committed in the near future; in the mean time it is possible to reference the README included in the last UFS ACL distribution placed in the TrustedBSD web site: http://www.TrustedBSD.org/downloads/ Substantial debugging, hardware, travel, or connectivity support for this project was provided by: BSDi, Safeport Network Services, and NAI Labs. Significant coding contributions were made by Chris Faulhaber. Additional support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin. Reviewed by: jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs Obtained from: TrustedBSD Project
# 74256	14-Mar-2001	rwatson	o In my merge, missed the one-line patch to ufs_vnops.c that removed the static prototype for ufs_readdir(). Note that ufs_readdir() was actually already non-static, the prototype was incorrect. Submitted by: jedgar
# 72956	23-Feb-2001	jlemon	Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean(). Use this to tell a filter attached to a vnode that the underlying vnode is no longer valid, by returning EV_EOF. PR: kern/25309, kern/25206
# 72953	23-Feb-2001	jlemon	Use correct list pointer when detaching knote from list.
# 72521	15-Feb-2001	jlemon	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 68307	04-Nov-2000	bde	Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 of ufs_vnops.c: 1) i_ino was confused with i_number, so the inode number passed to VFS_VGET() was usually wrong (usually 0U). 2) ip was dereferenced after vgone() freed it, so the inode number passed to VFS_VGET() was sometimes not even wrong. Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the inode number, so inode number 0U gives a way out of bounds array index. Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba() doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?) from the disk for the invalid inode number 0U; ufs_mknod() returns a wrong vnode, but most callers just vput() it; the correct vnode is eventually obtained by an implicit VFS_VGET() just like it used to be. Bug (2) usually doesn't happen.
# 68186	01-Nov-2000	eivind	Give vop_mmap an untimely death. The opportunity to give it a timely death timed out in 1996.
# 67893	29-Oct-2000	phk	Move suser() and suser_xxx() prototypes and a related #define from <sys/proc.h> to <sys/systm.h>. Correctly document the #includes needed in the manpage. Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.
# 67309	19-Oct-2000	rwatson	o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform "administrative" authorization checks. In most cases, the VADMIN test checks to make sure the credential effective uid is the same as the file owner. o Modify vaccess() to set VADMIN as an available right if the uid is appropriate. o Modify references to uid-based access control operations such that they now always invoke VOP_ACCESS() instead of using hard-coded policy checks. o This allows alternative UFS policies to be implemented by replacing only ufs_access() (such as mandatory system policies). o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the vnode: I believe that new invocations of VOP_ACCESS() are always called with the lock held. o Some direct checks of the uid remain, largely associated with the QUOTA and SUIDDIR code. Reviewed by: eivind Obtained from: TrustedBSD Project
# 66615	03-Oct-2000	jasone	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
# 66355	25-Sep-2000	bp	Add a lock structure to vnode structure. Previously it was either allocated separately (nfs, cd9660 etc) or keept as a first element of structure referenced by v_data pointer(ffs). Such organization leads to known problems with stacked filesystems. From this point vop_nolock() functions maintain only interlock lock. vop_stdlock() functions maintain built-in v_lock structure using lockmgr(). vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared lock on vnode. If filesystem wishes to export lockmgr compatible lock, it can put an address of this lock to v_vnlock field. This indicates that the upper filesystem can take advantage of it and use single lock structure for entire (or part) of stack of vnodes. This field shouldn't be examined or modified by VFS code except for initialization purposes. Reviewed in general by: mckusick
# 66040	18-Sep-2000	rwatson	o Allow privileged processes in jail() to override sticky bit behavior on directories. o Allow privileged processes in jail() to create inodes with the setgid bit set even if they are not a member of the group denoted by the file creation gid. This occurs due to inherited gid's from parent directories on file creation, allowing a user to create a file with a gid that is not in the creating process's credentials. Obtained from: TrustedBSD Project
# 66039	18-Sep-2000	rwatson	o Add a comment clarifying interaction between jail(), privileged processes, and UFS file flags. Here's what the comment says, for reference: Privileged processes in jail() are permitted to modify arbitrary user flags on files, but are not permitted to modify system flags. In other words, privilege does allow a process in jail to modify user flags for objects that the process does not own, but privilege will not permit the setting of system flags on the file. Obtained from: TrustedBSD Project
# 66038	18-Sep-2000	rwatson	o Add missing PRISON_ROOT allowing a privileged process in a jail() to not remove the setuid/setgid bits by virtue of a change to a file with those bits set, even if the process doesn't own the file, or isn't a group member of the file's gid. Obtained from: TrustedBSD Project
# 66033	18-Sep-2000	rwatson	o Substitute suser() calls for direct credential checks, which is now safe as suser() no longer sets ASU. o Note that in some cases, the PRISON_ROOT flag is used even though no process structure is passed, to indicate that if a process structure (and hence jail) was available, it would be ok. In the long run, the jail identifier should probably be moved to ucred, as the uidinfo information was. o Some uid 0 checks remain relating to the quota code, which I'll leave for another day. Reviewed by: phk, eivind Obtained from: TrustedBSD Project
# 65928	16-Sep-2000	phk	Remove a pointless casting of a gid_t to a gid_t.
# 65200	29-Aug-2000	rwatson	o Restructure vaccess() so as to check for DAC permission to modify the object before falling back on privilege. Make vaccess() accept an additional optional argument, privused, to determine whether privilege was required for vaccess() to return 0. Add commented out capability checks for reference. Rename some variables to make it more clear which modes/uids/etc are associated with the object, and which with the access mode. o Update file system use of vaccess() to pass NULL as the optional privused argument. Once additional patches are applied, suser() will no longer set ASU, so privused will permit passing of privilege information up the stack to the caller. Reviewed by: bde, green, phk, -security, others Obtained from: TrustedBSD Project
# 64865	20-Aug-2000	phk	Centralize the canonical vop_access user/group/other check in vaccess(). Discussed with: bde
# 63897	26-Jul-2000	mckusick	Clean up the snapshot code so that it no longer depends on the use of the SF_IMMUTABLE flag to prevent writing. Instead put in explicit checking for the SF_SNAPSHOT flag in the appropriate places. With this change, it is now possible to rename and link to snapshot files. It is also possible to set or clear any of the owner, group, or other read bits on the file, though none of the write or execute bits can be set. There is also an explicit test to prevent the setting or clearing of the SF_SNAPSHOT flag via chflags() or fchflags(). Note also that the modify time cannot be changed as it needs to accurately reflect the time that the snapshot was taken. Submitted by: Robert Watson <rwatson@FreeBSD.org>
# 63788	24-Jul-2000	mckusick	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.
# 62976	11-Jul-2000	mckusick	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 60041	05-May-2000	phk	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# 59794	30-Apr-2000	phk	Remove unneeded #include <vm/vm_zone.h> Generated by: src/tools/tools/kerninclude
# 59289	16-Apr-2000	jlemon	Replace the POLLEXTEND extensions with the kqueue() mechanism.
# 59249	15-Apr-2000	phk	Complete the bio/buf divorce for all code below devfs::strategy Exceptions: Vinum untouched. This means that it cannot be compiled. Greg Lehey is on the case. CCD not converted yet, casts to struct buf (still safe) atapi-cd casts to struct buf to examine B_PHYS
# 59241	15-Apr-2000	rwatson	Introduce extended attribute support for FFS, allowing arbitrary (name, value) pairs to be associated with inodes. This support is used for ACLs, MAC labels, and Capabilities in the TrustedBSD security extensions, which are currently under development. In this implementation, attributes are backed to data vnodes in the style of the quota support in FFS. Support for FFS extended attributes may be enabled using the FFS_EXTATTR kernel option (disabled by default). Userland utilities and man pages will be committed in the next batch. VFS interfaces and man pages have been in the repo since 4.0-RELEASE and are unchanged. o ufs/ufs/extattr.h: UFS-specific extattr defines o ufs/ufs/ufs_extattr.c: bulk of support routines o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes o contrib/softupdates/ffs_softdep.c: extattr.h includes o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h (This should not be the case, and will be fixed in a future commit) Currently attributes are not supported in MFS. This will be fixed. Reviewed by: adrian, bp, freebsd-fs, other unthanked souls Obtained from: TrustedBSD Project
# 58934	02-Apr-2000	phk	Move B_ERROR flag to b_ioflags and call it BIO_ERROR. (Much of this done by script) Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED. Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack. Add bio_queue field for struct bio aware disksort. Address a lot of stylistic issues brought up by bde.
# 58349	20-Mar-2000	phk	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.
# 57387	22-Feb-2000	rwatson	After much consulting with bde, concluded that this fix was the best fix to the current jail/chflags interactions. This fix conditionalizes ``root behavior'' in the chflags() case on not being in jail, so attempts to perform a chflags in a jail are limited to what a normal user could do. For example, this does allow setting of user flags as appropriate, but prohibits changing of system flags. Reviewed by: bde
# 57347	19-Feb-2000	rwatson	Disable chflags() from within jail() so that root within jail can't make a mess in securelevel environments. Results in one warning during /etc/rc as it attempts to remove file flags, but this is harmless. Approved by: High Lord Hubbard
# 55697	09-Jan-2000	mckusick	Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
# 54655	15-Dec-1999	eivind	Introduce NDFREE (and remove VOP_ABORTOP)
# 53131	13-Nov-1999	eivind	Remove WILLRELE from VOP_SYMLINK Note: Previous commit to these files (except coda_vnops and devfs_vnops) that claimed to remove WILLRELE from VOP_RENAME actually removed it from VOP_MKNOD.
# 53101	12-Nov-1999	eivind	Remove WILLRELE from VOP_RENAME
# 52838	03-Nov-1999	bde	Quick fix for breakage of ext2fs link counts as reported by stat(2) by the soft updates changes: only report the link count to be i_effnlink in ufs_getattr() for file systems that maintain i_effnlink. Tested by: Mike Dracopoulos <mdraco@math.uoa.gr>
# 50521	28-Aug-1999	phk	remove unused variables.
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 50405	26-Aug-1999	phk	Simplify the handling of VCHR and VBLK vnodes using the new dev_t: Make the alias list a SLIST. Drop the "fast recycling" optimization of vnodes (including the returning of a prexisting but stale vnode from checkalias). It doesn't buy us anything now that we don't hardlimit vnodes anymore. Rename checkalias2() and checkalias() to addalias() and addaliasu() - which takes dev_t and udev_t arg respectively. Make the revoke syscalls use vcount() instead of VALIASED. Remove VALIASED flag, we don't need it now and it is faster to traverse the much shorter lists than to maintain the flag. vfs_mountedon() can check the dev_t directly, all the vnodes point to the same one. Print the devicename in specfs/vprint(). Remove a couple of stale LFS vnode flags. Remove unimplemented/unused LK_DRAINED;
# 50253	23-Aug-1999	bde	Use devtoname() to print dev_t's instead of casting them to long or u_long for misprinting in %lx format.
# 50137	21-Aug-1999	jdp	Support full-precision file timestamps. Until now, only the seconds have been maintained, and that is still the default. A new sysctl variable "vfs.timestamp_precision" can be used to enable higher levels of precision: 0 = seconds only; nanoseconds zeroed (default). 1 = seconds and nanoseconds, accurate within 1/HZ. 2 = seconds and nanoseconds, truncated to microseconds. >=3 = seconds and nanoseconds, maximum precision. Level 1 uses getnanotime(), which is fast but can be wrong by up to 1/HZ. Level 2 uses microtime(). It might be desirable for consistency with utimes() and friends, which take timeval structures rather than timespecs. Level 3 uses nanotime() for the higest precision. I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with "cpio -pdu". There was almost negligible difference in the system times -- much less than 1%, and less than the variation among multiple runs at the same level. Bruce Evans dreamed up a torture test involving 1-byte reads with intervening fstat() calls, but the cpio test seems more realistic to me. This feature is currently implemented only for the UFS (FFS and MFS) filesystems. But I think it should be easy to support it in the others as well. An earlier version of this was reviewed by Bruce. He's not to blame for any breakage I've introduced since then. Reviewed by: bde (an earlier version of the code)
# 49682	13-Aug-1999	phk	Move the special-casing of stat(2)->st_blksize for device files from UFS to the generic level. For chr/blk devices we don't care about the blocksize of the filesystem, we want what the device asked for.
# 49678	13-Aug-1999	phk	s/v_specinfo/v_rdev/
# 49535	08-Aug-1999	phk	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
# 48801	13-Jul-1999	mckusick	Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has been set for a mount point. Insert missing checks to ensure that all write operations are done asynchronously when the MNT_ASYNC option has been requested. Submitted by: Craig A Soules <soules+@andrew.cmu.edu> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 47964	16-Jun-1999	mckusick	Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
# 47028	11-May-1999	phk	Divorce "dev_t" from the "major\|minor" bitmap, which is now called udev_t in the kernel but still called dev_t in userland. Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev() For now they're functions, they will become in-line functions after one of the next two steps in this process. Return major/minor/makedev to macro-hood for userland. Register a name in cdevsw[] for the "filedescriptor" driver. In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device. In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang). A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that. Without DEVT_FASCIST I belive this patch is a no-op. Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result. Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
# 46155	28-Apr-1999	phk	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
# 46112	27-Apr-1999	phk	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
# 44395	02-Mar-1999	imp	Merge patch to ufs_vnops.c's ufs_rename to the copy of ufs_rename that lives in ext2_vnops.c for ext2fs. Also remove cast from comparision. Bruce pointed out that it was bogus since we'd force a signed comparision when we really wanted an unsigned comparison.
# 44291	26-Feb-1999	imp	Fix last commit based on feedback from Guido, Bruce and Terry. Specifically, the test was in the wrong place, lacked a cast, didn't unlock the node, and exited to bad rather than abortit. Now we don't allow renaming of a file with LINK_MAX references. Move the test to earlier in the code as it is closer to where ip is obtained, as that is the style of the rest of the function. Didn't fix the problems bruce pointed out in the rename man page to include EMLINK, nor address his complaints about how the whole idea of incrementing the link count during a rename is potentially asking for trouble. Also didn't try to correct potential problem Terry pointed out with decrements not being similarly protected against underflow.
# 44253	25-Feb-1999	imp	Add missing check for LINK_MAX in ufs_rename. Since ip->i_effnlink and ip->nlink were different types, there was a masked overflow. Reported by: Mark Slemco <marcs@znep.com>
# 44248	25-Feb-1999	dillon	Update ufs_vnops code to use new specinfo fields rather then guess. This is part of general specinfo / d_parms() commit.
# 43958	13-Feb-1999	dillon	Remove XXX comment in regarsd to why NFS doesn't use VOP_ABORT(). NFS is being fixed now.
# 43311	27-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 42957	21-Jan-1999	dillon	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# 42374	07-Jan-1999	bde	Don't pass unused unused timestamp args to UFS_UPDATE() or waste time initializing them. This almost finishes centralizing (in-core) timestamp updates in ufs_itimes().
# 42042	24-Dec-1998	bde	Fixed null pointer panics which I introduced in rev.1.86. Vnodes may be revoked, so vnop routines must be careful about accessing the vnode if they may have blocked. Fixed marking for update after successfully reading or writing 0 bytes. In this case, POSIX.1 specifies marking if and only if the requested count is nonzero, but rev.1.86 never marked.
# 41954	20-Dec-1998	dfr	In ufs_setattr(), if only one of va_atime or va_mtime are != VNOVAL, then the code set the other field in the inode to VNOVAL. This can happen sometimes on an NFS server.
# 41610	09-Dec-1998	eivind	Make compare correct with unsigned types. (Problem introduced by Lite/2).
# 39796	29-Sep-1998	mckusick	Do not allow a mounted on directory to be rmdir'ed. This removal can happen when an NFS exported filesystem tries to remove a locally mounted on directory. PR: kern/7272 Submitted by: Andre Albsmeier <andre.albsmeier@mchp.siemens.de>
# 38292	12-Aug-1998	msmith	"The releaseing of the reference and lock is not temporary and belongs where it is. The reference and lock(s) are acquired just above the code in VREF() and relookup()." Submitted by: Michael Hancock <michaelh@cet.co.jp>
# 38291	12-Aug-1998	julian	Handle the case of moving a directory onto the top of a sibling's child of the same name. Submitted by: Kirk Mckusick with fixes from luoqi Chen Obtained from: Whistle test tree.
# 37887	27-Jul-1998	bde	Made lazy syncing of timestamps for special files non-optional.
# 37555	11-Jul-1998	bde	Fixed printf format errors.
# 37539	09-Jul-1998	julian	Add code missed in the initial Soft updates integration. Make the unallocated parts of a directry have a know state in case we need it later.
# 37384	04-Jul-1998	julian	VOP_STRATEGY grows an (struct vnode *) argument as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>
# 37364	03-Jul-1998	bde	Restored revs.1.89-1.90 which I somehow clobbered in rev.1.91.
# 37363	03-Jul-1998	bde	Sync timestamp changes for inodes of special files to disk as late as possible (when the inode is reclaimed). Temporarily only do this if option UFS_LAZYMOD configured and softupdates aren't enabled. UFS_LAZYMOD is intentionally left out of /sys/conf/options. This is mainly to avoid almost useless disk i/o on battery powered machines. It's silly to write to disk (on the next sync or when the inode becomes inactive) just because someone hit a key or something wrote to the screen or /dev/null. PR: 5577 Previous version reviewed by: phk
# 37362	03-Jul-1998	bde	Centralized in-core inode update. Update the in-core inode directly in ufs_setattr() so that there is no need to pass timestamps to UFS_UPDATE() (everything else just needs the current time). Ignore the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes() (was: itimes()) to do the update. The timestamps are still passed so that all the callers don't need to be changed yet.
# 37182	27-Jun-1998	phk	Make vprint() print dev_t in hex also.
# 37181	27-Jun-1998	phk	Report the type from the inode, not the vnode.
# 36779	08-Jun-1998	julian	The version of the softdep changes in FreeBSD broke the (doingdirectory && !newparent) case of ufs_rename(). rename("D1/X/", "D2/Y/") gives a wrong link count for D2. Submitted by: Bruce Evans <bde@zeta.org.au> Reviewed by: Kirk McKusick <mckusick@McKusick.COM>
# 36723	07-Jun-1998	bde	Null change. Forgot to mention in previous log message that MNT_NOATIME is now ignored for special files, so that mounting root with option noatime doesn't break reporting of idle times in programs like `w'. The problem of execessive disk updates just to stamp atimes will be handled for special files by only writing atimes to disk when inodes become active. This works well because special files are relatively uncommon and their atimes are even more disposable at panic time than regular files' atimes.
# 36721	07-Jun-1998	bde	Fixed some longstanding timestamp bugs: 1. mark atimes and mtimes of special files and fifos for update upon successful completion of non-null i/o, not at the beginning of the syscall. 2. never update file times for readonly filesystems. They were updated for stats and closes but not for syncs. The updates were of course only in-core and were thrown away when the inode was uncached, so the times sometimes appeared to go backwards. Improved comments in code related to (1) (mostly by removing them). Unmacroized ITIMES(). The test in (2) bloated it even more. Don't call getmicrotime() in the function version of it when we only need the time in seconds.
# 36119	17-May-1998	phk	s/nanoruntime/nanouptime/g s/microruntime/microuptime/g Reviewed by: bde
# 35823	07-May-1998	msmith	In the words of the submitter: --------- Make callers of namei() responsible for releasing references or locks instead of having the underlying filesystems do it. This eliminates redundancy in all terminal filesystems and makes it possible for stacked transport layers such as umapfs or nullfs to operate correctly. Quality testing was done with testvn, and lat_fs from the lmbench suite. Some NFS client testing courtesy of Patrik Kudo. vop_mknod and vop_symlink still release the returned vpp. vop_rename still releases 4 vnode arguments before it returns. These remaining cases will be corrected in the next set of patches. --------- Submitted by: Michael Hancock <michaelh@cet.co.jp>
# 35256	17-Apr-1998	des	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
# 35029	04-Apr-1998	phk	Time changes mark 2: * Figure out UTC relative to boottime. Four new functions provide time relative to boottime. * move "runtime" into struct proc. This helps fix the calcru() problem in SMP. * kill mono_time. * add timespec{add\|sub\|cmp} macros to time.h. (XXX: These may change!) * nanosleep, select & poll takes long sleeps one day at a time Reviewed by: bde Tested by: ache and others
# 34961	30-Mar-1998	phk	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
# 34901	26-Mar-1998	phk	Add two new functions, get{micro\|nano}time. They are atomic, but return in essence what is in the "time" variable. gettime() is now a macro front for getmicrotime(). Various patches to use the two new functions instead of the various hacks used in their absence. Some puntuation and grammer patches from Bruce. A couple of XXX comments.
# 34266	08-Mar-1998	julian	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
# 33181	09-Feb-1998	eivind	Staticize.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32944	31-Jan-1998	julian	Serves me right for not puting SUIDDIR in LINT. it got bitrot. This should stop complaints about it not working for people.
# 32011	27-Dec-1997	bde	Unspammed nested include of <vm/vm_zone.h>.
# 31749	15-Dec-1997	eivind	Convert SUIDDIR fully to a new-style option. Forgotten by: julian
# 31727	15-Dec-1997	wollman	Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL if one of the new poll types is requested; hopefully this will not break any existing code. (This is done so that programs have a dependable way of determining whether a filesystem supports the extended poll types or not.) The new poll types added are: POLLWRITE - file contents may have been modified POLLNLINK - file was linked, unlinked, or renamed POLLATTRIB - file's attributes may have been changed POLLEXTEND - file was extended Note that the internal operation of poll() means that it is impossible for two processes to reliably poll for the same event (this could be fixed but may not be worth it), so it is not possible to rewrite `tail -f' to use poll at this time.
# 31699	13-Dec-1997	bde	Restored ufs_pathconf() from rev.1.61. vop_stdpathconf() is too general to be of much use. Using it here broke the _PC_NAME_MAX, _PC_NO_TRUNC and _PC_PATH_MAX cases, and weakened the _PC_MAX_CANON, _PC_MAX_INPUT and _PC_VDISABLE cases.
# 31683	12-Dec-1997	peter	Fix(?) some style consistancy breakage and do some other nit-picking on the SUIDDIR changes.
# 31312	20-Nov-1997	bde	Fixed marking of access time for special files and fifos (don't do it if the file system is mounted noatime). Not fixed: the access time is marked at the start of a read() and not marked on successful completion. I think this should be handled at the vfs level. Print a better panic message for missing vops. Don't use printf() before panic(), since the printf()ed part isn't shown by gdb. This actually loses a little with the current gdb, since gdb just prints the fmt arg to panic, so %'s aren't expanded. gdb should fetch the full message from the message buffer if possible. Fixed default vop function for vop_getpages_desc. It needs to just return EOPNOTSUPP so that the vnode pager can get the pages in using a general method. Panicing broke exec'ing of files on ext2fs file systems. ffs works because it doesn't use the default. Fixed nearby style bugs.
# 31269	18-Nov-1997	phk	unifdef -UEXT2FS
# 31147	12-Nov-1997	julian	oops, fix left out semicolon in code I patched by hand.
# 31144	12-Nov-1997	julian	Reviewed by: hackers@freebsd.org in general Obtained from: Whistle Communications tree Add an option to the way UFS works dependent on the SUID bit of directories This changes makes things a whole lot simpler on systems running as fileservers for PCs and MACS. to enable the new code you must 1/ enable option SUIDDIR on the kernel. 2/ mount the filesystem with option suiddir. hopefully this makes it difficult enough for people to do this accidentally. see the new chmod(2) man page for detailed info.
# 30780	27-Oct-1997	bde	Removed unused #includes. The need for most of them went away with recent changes (docluster* and vfs improvements).
# 30743	26-Oct-1997	phk	VFS interior redecoration. Rename vn_default_error to vop_defaultop all over the place. Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite. Use vop_null instead of nullop. Move vop_nopoll from vfs_subr.c to vfs_default.c Move vop_sharedlock from vfs_subr.c to vfs_default.c Move vop_nolock from vfs_subr.c to vfs_default.c Move vop_nounlock from vfs_subr.c to vfs_default.c Move vop_noislocked from vfs_subr.c to vfs_default.c Use vop_ebadf instead of *_ebadf. Add vop_defaultop for getpages on master vnode in MFS.
# 30513	17-Oct-1997	phk	Make a set of VOP standard lock, unlock & islocked VOP operators, which depend on the lock being located at vp->v_data. Saves 3x3 identical vop procs, more as the other filesystems becomes lock aware.
# 30492	16-Oct-1997	phk	Another VFS cleanup "kilo commit" 1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS} intereface function, and now lives in the ufsmount structure. 2. Remove VOP_SEEK, it was unused. 3. Add mode default vops: VOP_ADVLOCK vop_einval VOP_CLOSE vop_null VOP_FSYNC vop_null VOP_IOCTL vop_enotty VOP_MMAP vop_einval VOP_OPEN vop_null VOP_PATHCONF vop_einval VOP_READLINK vop_einval VOP_REALLOCBLKS vop_eopnotsupp And remove identical functionality from filesystems 4. Add vop_stdpathconf, which returns the canonical stuff. Use it in the filesystems. (XXX: It's probably wrong that specfs and fifofs sets this vop, shouldn't it come from the "host" filesystem, for instance ufs or cd9660 ?) 5. Try to make system wide VOP functions have vop_* names. 6. Initialize the um_* vectors in LFS. (Recompile your LKMS!!!)
# 30476	16-Oct-1997	phk	Staticize the ufs vnops member functions.
# 30474	16-Oct-1997	phk	VFS mega cleanup commit (x/N) 1. Add new file "sys/kern/vfs_default.c" where default actions for VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE, POLL, REVOKE and STRATEGY. Various stuff spread over the entire tree belongs here. 2. Change VOP_BLKATOFF to a normal function in cd9660. 3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These are private interface functions between UFS and the underlying storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now live in struct ufsmount instead. 4. Remove a kludge of VOP_ functions in all filesystems, that did nothing but obscure the simplicity and break the expandability. If a filesystem doesn't implement VOP_FOO, it shouldn't have an entry for it in its vnops table. The system will try to DTRT if it is not implemented. There are still some cruft left, but the bulk of it is done. 5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
# 30439	15-Oct-1997	phk	vnops megacommit 1. Use the default function to access all the specfs operations. 2. Use the default function to access all the fifofs operations. 3. Use the default function to access all the ufs operations. 4. Fix VCALL usage in vfs_cache.c 5. Use VOCALL to access specfs functions in devfs_vnops.c 6. Staticize most of the spec and fifofs vnops functions. 7. Make UFS panic if it lacks bits of the underlying storage handling.
# 29653	21-Sep-1997	dyson	Change the M_NAMEI allocations to use the zone allocator. This change plus the previous changes to use the zone allocator decrease the useage of malloc by half. The Zone allocator will be upgradeable to be able to use per CPU-pools, and has more intelligent usage of SPLs. Additionally, it has reasonable stats gathering capabilities, while making most calls inline.
# 29362	14-Sep-1997	peter	Convert select -> poll. Delete 'always succeed' select/poll handlers, replaced with generic call. Flag missing vnode op table entries.
# 29041	02-Sep-1997	bde	Removed unused #includes.
# 28774	26-Aug-1997	dyson	Back out some incorrect changes that was worse than the original bug.
# 28598	22-Aug-1997	dyson	Fix the "remove optimization" by removing it. Sorry for the trouble.
# 28466	20-Aug-1997	dyson	Performance improvment to minimize delayed write output of files that have been deleted. Submitted by: Peter M. Chen <pmchen@eecs.umich.edu>
# 27378	13-Jul-1997	bde	Always mark st_ctime for update upon successful completion of chown(). Previously, it wasn't marked for null chown()'s. We permit null chown()s as a special case of "appropriate privilege" - everyone has enough priviilege to not change ids (this is a better argument than the one I gave for rev.1.13, that null changes aren't really changes). However, POSIX.1 requires the update independently of whether anything has changed. Clear both the setuid and the setgid bits upon successful completion of non-null chown()s by non-root. Previously, the setuid bit was only changed for non-null changes of the uid, etc. POSIX.1 requires clearing both unless the call was made by a process with "appropriate privilege", in which case altering the bits is implementation-defined. We define appropriate privilege as `process is root, or the change is null', and the implementation-defined behaviour as not altering the bits. There is no interpretation that permits clearing only one of the bits. Reviewed by: jdp
# 26360	02-Jun-1997	julian	Submitted by: Whistle Communications (archie Cobbs) These changes add the ability to specify that a UFS file/directory cannot be unlinked. This is basically a scaled back version of the IMMUTABLE flag. The reason is to allow an administrator to create a directory hierarchy that a group of users can arbitrarily add/delete files from, but that the hierarchy itself is safe from removal by them. If the NOUNLINK definition is set to 0 then this results in no change to what happens normally. (and results in identical binary (in the kernel)). It can be proven that if this bit is never set by the admin, no new behaviour is introduced.. Several "good idea" comments from reviewers plus one grumble about creeping featurism. This code is in production in 2.2 based systems
# 25877	17-May-1997	phk	Remove redundant check for vp == dvp (done in VFS before calling).
# 24438	31-Mar-1997	peter	Treat symlinks as first class citizens with their own uid/gid rather than as shadows of their containing directory. This should solve the problem of users not being able to delete their symlinks from /tmp once and for all. Symlinks do not have modes though, they are accessable to everything that can read the directory (as before). They are made to show this fact at lstat time (they appear as mode 0777 always, since that's how the the lookup routines in the kernel treat them). More commits will follow, eg: add a real lchown() syscall and man pages.
# 24101	22-Mar-1997	bde	Fixed some invalid (non-atomic) accesses to `time', mostly ones of the form `tv = time'. Use a new function gettime(). The current version just forces atomicicity without fixing precision or efficiency bugs. Simplified some related valid accesses by using the central function.
# 23562	09-Mar-1997	mpp	Update a number of routines to reflect the actual name of the routine that caused the panic.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 22619	12-Feb-1997	bde	Removed FIFO ifdef again (see rev.1.5).
# 22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 19388	04-Nov-1996	bde	Fixed some races and misleading comments in ufs_rename(). 1. When a directory is renamed to an existing (empty) directory, it is possible for the target vnode to become the source vnode underneath you (because another process may complete the same rename). It was assumed that this can't happen, and the bogus errno EINVAL was returned. This was fairly harmless. Fix: return ENOENT instead, as if the source directory was renamed a little earlier. 2. The same metamorphosis is possible for non-directories. It was assumed that this can't happen, and the code for handling "just removing a link name" happened to be used. This would have worked except for fatal bugs in the link name removal - the link name was assumed to still be there, and a null pointer was followed. Fix: check the result of relookup(). This fixes PR 1930. Notes: (a) POSIX seems to say that removing link names shall have no effect. BSD (4.4Lite2 at least) does something reasonable instead. (b) The relookup() may find a file unrelated to the original. Removing this isn't correct. Consider 3 existing files A, B and C, and concurrent renames: AB = rename(A, B), another AB, and CA = rename("c", "a"). If rename() is atomic, then only the following results are possible: AB, AB (fails), CA: A = original C, B = original A, C = gone AB, CA, AB: A = gone, B = original C, C = gone CA, AB, AB (fails): A = gone, B = original C, C = gone but ufs_rename() can give: A,AB,CA,B (sorta): A = gone, B = original A, C = gone This usually doesn't matter, since getting into a race is usually an error. --- These fixes should be in 2.1.6 and 2.2.
# 18397	19-Sep-1996	nate	In sys/time.h, struct timespec is defined as: /* * Structure defined by POSIX.4 to be like a timeval. / struct timespec { time_t ts_sec; / seconds / long ts_nsec; / and nanoseconds */ }; The correct names of the fields are tv_sec and tv_nsec. Reminded by: James Drobina <jdrobina@infinet.com>
# 18020	03-Sep-1996	bde	Eliminated nested include of <sys/unistd.h> in <sys/file.h> in the kernel. Include it directly in the few places where it is used. Reduced some #includes of <sys/file.h> to #includes of <sys/fcntl.h> or nothing.
# 17040	09-Jul-1996	wollman	Quiet a couple of -Wunused warnings.
# 14909	29-Mar-1996	bde	Fixed reference counting related to relookup(). relookup() must be called with the directory referenced, and this reference will be dropped iff relookup() fails, so the value returned must not be ignored. Reviewed by: davidg
# 13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 13260	05-Jan-1996	wollman	Convert QUOTA to new-style option.
# 12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# 12399	19-Nov-1995	dyson	Change incorrect '#if EXT2FS' to '#ifdef EXT2FS'
# 12158	09-Nov-1995	bde	Introduced a type `vop_t' for vnode operation functions and used it 1138 times (:-() in casts and a few more times in declarations. This change is null for the i386. The type has to be `typedef int vop_t(void *)' and not `typedef int vop_t()' because `gcc -Wstrict-prototypes' warns about the latter. Since vnode op functions are called with args of different (struct pointer) types, neither of these function types is any use for type checking of the arg, so it would be preferable not to use the complete function type, especially since using the complete type requires adding 1138 casts to avoid compiler warnings and another 40+ casts to reverse the function pointer conversions before calling the functions.
# 12117	05-Nov-1995	dyson	Changes to existing files for ext2fs support. The UFS mods need rework in the future as they are a bit crufty -- but at least the stuff is in the tree now.
# 11644	22-Oct-1995	dg	Moved the filesystem read-only check out of the syscalls and into the filesystem layer, as was done in lite-2. Merged in some other cosmetic changes while I was at it. Rewrote most of msdosfs_access() to be more like ufs_access() and to include the FS read-only check. Obtained from: partially from 4.4BSD-lite2
# 11297	07-Oct-1995	bde	Return EINVAL instead of panicing for rename("dir1", "dir2/.."). Fixes part of PR 760. This bug seems to be very old.
# 10646	08-Sep-1995	julian	Obtained from:4.4lite2 fix a change where a shortcut resulted in teh wrong answer.. e.g. touch a touch b mv a b resulted in b being removed and a being moved to b in the shortcut.. touch a ln a b mv a b the wrong link was removed.. leaving a instead of b, giving a different result to when both files were separate.
# 10551	03-Sep-1995	dyson	Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count for VOP_BMAP. Updated affected filesystems...
# 10358	28-Aug-1995	julian	Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular.. NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases.. certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task) The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
# 9842	01-Aug-1995	dg	Removed my special-case hack for VOP_LINK and fixed the problem with the wrong vp's ops vector being used by changing the VOP_LINK's argument order. The special-case hack doesn't go far enough and breaks the generic bypass routine used in some non-leaf filesystems. Pointed out by Kirk McKusick.
# 9354	28-Jun-1995	dg	Fixed VOP_LINK argument order botch.
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 8529	15-May-1995	dg	From Bruce Evans: I ran into another manifestation of the problem reported in PR 211 and fixed it. Try this: as non-root: cd /tmp; mkdir x y x/z as root: chown root /tmp/x/z as non-root: cd /tmp/x; mv z ../y # EACCES as expected as root: cd /tmp/x; mv z ../y # EINVAL NOT as expected This is because ufs_rename() sets IN_RENAME and fails to clear it. Reviewed by: davidg Submitted by: bde
# 8053	25-Apr-1995	dyson	Fixed the mmap hang fix previously committed so that it works with options DIAGNOSTIC, and clear up an additional reference count problem.
# 8041	24-Apr-1995	dyson	Changes to get rid of ufslk2 hangs when doing read/write to/from mmap regions that are in the same file as the read/write.
# 7695	09-Apr-1995	dg	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# 7169	19-Mar-1995	dg	Backed out change to panic call: As Chris just pointed out to me, panic() does indeed work like printf(). gdb gets the string untranslated for some reason.
# 7156	19-Mar-1995	dg	Fix a call to panic: panic doesn't do token substitution on the panic string.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 7006	11-Mar-1995	dg	Removed gratuitous and extremely evil setting of OBJ_INTERNAL. This caused a cascade of problems including kernel memory corruption, file corruption, system hangs, and panics.
# 6357	14-Feb-1995	phk	YF fix.
# 5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 4827	26-Nov-1994	bde	Submitted by: Kirk McKusick Allow chown() to return success if the gid isn't changed even if the gid is not the caller's. Such gids are normal for files created in world-writable directories sucj as /tmp. This "fixes" annoying error messages for mv'ing files created in /tmp to another file system. mv still preserves the foreign gid of /tmp, but now does it silently.
# 3745	20-Oct-1994	wollman	Make my ALLDEVS kernel compile (basically, LINT minus a lot of options). This involves fixing a few things I broke last time.
# 3605	15-Oct-1994	ache	Add back variable declaration removed by wrong previous cleanups
# 3427	08-Oct-1994	phk	POSSIBLE BOGUS CODE found, (related to dos-partitions) in ufs_disksubr.c, look for CC_WALL. Cosmetics, a couple of unused vars.
# 3420	07-Oct-1994	phk	Cosmetics.
# 3396	06-Oct-1994	dg	Use tsleep() rather than sleep so that 'ps' is more informative about the wait.
# 3167	28-Sep-1994	dfr	Make NFS ask the filesystems for directory cookies instead of making them itself.
# 3148	27-Sep-1994	phk	Moved the "relookup" routine into vfs_lookup.c from ufs/ufs/ufs_vnops.c. Several FS's use this, so it doesn't belong in ufs. (unionfs, msdosfs and ufs)
# 2979	22-Sep-1994	wollman	More loadable VFS changes: - Make a number of filesystems work again when they are statically compiled (blush) - FIFOs are no longer optional; ``options FIFO'' removed from distributed config files.
# 1960	08-Aug-1994	dg	Made lockf advisory locking code generic (rather than ufs specific), and use it in NFS. This is required both for diskless support and for POSIX compliance. Note: the support in NFS is only for the local node. Submitted by: based on work originally done by Yuval Yurom
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources