Cross Reference: /freebsd-current/sys/kern/uipc

History log of /freebsd-current/sys/kern/uipc_shm.c
Revision	Date	Author	Comments
# 7975f57b	20-May-2024	Ricardo Branco <rbranco@suse.de>	uipc_shm: Fix double check for shmfd->shm_path Reviewed by: emaste, zlei Pull Request: https://github.com/freebsd/freebsd-src/pull/1250
# e411b227	18-Apr-2024	Mark Johnston <markj@FreeBSD.org>	uipc_shm: Fix a free() of an uninitialized variable Reported by: Coverity CID: 1544043 Fixes: b112232e4fb9 ("uipc_shm: Copyin userpath for ktrace(2)")
# b112232e	09-Apr-2024	Jake Freeland <jfree@FreeBSD.org>	uipc_shm: Copyin userpath for ktrace(2) If userpath is not SHM_ANON, then copy it in early so ktrace(2) can record it. Without this change, ktrace(2) will attempt to strcpy a userspace string and trigger a page fault. Reported by: syzbot+490b9c2a89f53b1b9779@syzkaller.appspotmail.com Fixes: 0cd9cde767c3 Approved by: markj (mentor) Reviewed by: markj MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D44702
# 0cd9cde7	06-Apr-2024	Jake Freeland <jfree@FreeBSD.org>	ktrace: Record namei violations with KTR_CAPFAIL Report namei path lookups while Capsicum violation tracing with CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic capability mode behavior. Reviewed by: markj Approved by: markj (mentor) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D40680
# f28526e9	19-Jan-2024	Konstantin Belousov <kib@FreeBSD.org>	kcmp(2): implement for generic file types Reviewed by: brooks, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43518
# 2619c5cc	20-Nov-2023	Jason A. Harmening <jah@FreeBSD.org>	Avoid waiting on physical allocations that can't possibly be satisfied - Change vm_page_reclaim_contig[_domain] to return an errno instead of a boolean. 0 indicates a successful reclaim, ENOMEM indicates lack of available memory to reclaim, with any other error (currently only ERANGE) indicating that reclamation is impossible for the specified address range. Change all callers to only follow up with vm_page_wait* in the ENOMEM case. - Introduce vm_domainset_iter_ignore(), which marks the specified domain as unavailable for further use by the iterator. Use this function to ignore domains that can't possibly satisfy a physical allocation request. Since WAITOK allocations run the iterators repeatedly, this avoids the possibility of infinitely spinning in domain iteration if no available domain can satisfy the allocation request. PR: 274252 Reported by: kevans Tested by: kevans Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D42706
# 6df6facf	18-Aug-2023	Konstantin Belousov <kib@FreeBSD.org>	shmfd: hide direct rangelock(9) use under a wrapper Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# f3e11927	14-Aug-2023	Dmitry Chagin <dchagin@FreeBSD.org>	vm: Allow MAP_32BIT for all architectures Reviewed by: alc, kib, markj Differential revision: https://reviews.freebsd.org/D41435
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# 0919f29d	23-Nov-2022	Konstantin Belousov <kib@FreeBSD.org>	shmfd: account for the actually allocated pages Return the value as stat(2) st_blocks. Suggested and reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37097
# 37aea264	20-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	tmpfs: for used pages, account really allocated pages, instead of file sizes This makes tmpfs size accounting correct for the sparce files. Also correct report st_blocks/va_bytes. Previously the reported value did not accounted for the swapped out pages. PR: 223015 Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37097
# 7ec4b29b	23-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	uiomove_object: hide diagnostic under bootverbose Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37097
# 8c9aa94b	23-Jul-2022	Ka Ho Ng <khng@FreeBSD.org>	Convert runtime param checks to KASSERTs for fo_fspacectl Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D35880
# 7060da62	29-Jun-2022	Jamie Gritton <jamie@FreeBSD.org>	jail: Remove a prison's shared memory when it dies Add shm_remove_prison(), that removes all POSIX shared memory segments belonging to a prison. Call it from prison_cleanup() so a prison won't be stuck in a dying state due to the resources still held. PR: 257555 Reported by: grembo
# 9891cb1e	24-Feb-2022	Warner Losh <imp@FreeBSD.org>	Eliminate curlen, it's set but never used Sponsored by: Netflix
# d7c4ea7d	24-Feb-2022	Jamie Gritton <jamie@FreeBSD.org>	posixshm: Allow jails to use kern.ipc.posix_shm_list PR: 257554 Reported by: grembo@
# dc752617	17-Jan-2022	Mark Johnston <markj@FreeBSD.org>	posixshm: Report output buffer truncation from kern.ipc.posix_shm_list PR: 240573 Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33912
# 3b5331dd	21-Oct-2021	Konstantin Belousov <kib@FreeBSD.org>	uipc_shm: silent warnings about write-only variables in largepage code In shm_largepage_phys_populate(), the result from vm_page_grab() is only needed for assertion. In shm_dotruncate_largepage(), there is a commented-out prototype code for managed largepages. The oldobjsz is saved for its sake, so mark the variable as __unused directly. Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 2b68eb8e	01-Oct-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove thread argument from VOP_STAT and fo_stat.
# 747a4726	29-Sep-2021	Jamie Gritton <jamie@FreeBSD.org>	Fix error return of kern.ipc.posix_shm_list, which caused it (and thus "posixshmcontrol ls") to fail for all jails that didn't happen to own the last shm object in the list.
# 9e202d03	25-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	fspacectl(2): Changes on rmsr.r_offset's minimum value returned rmsr.r_offset now is set to rqsr.r_offset plus the number of bytes zeroed before hitting the end-of-file. After this change rmsr.r_offset no longer contains the EOF when the requested operation range is completely beyond the end-of-file. Instead in such case rmsr.r_offset is equal to rqsr.r_offset. Callers can obtain the number of bytes zeroed by subtracting rqsr.r_offset from rmsr.r_offset. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31677
# 5c1428d2	24-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	uipc_shm: Handle offset on shm_size as if it is beyond shm_size This avoids any unnecessary works in such case. Sponsored by: The FreeBSD Foundation Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D31655
# 1eaa3652	24-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	fspacectl(2): Clarifies the return values rmacklem@ spotted two things in the system call: - Upon returning from a successful operation, vop_stddeallocate can update rmsr.r_offset to a value greater than file size. This behavior, although being harmless, can be confusing. - The EINVAL return value for rqsr.r_offset + rqsr.r_len > OFF_MAX is undocumented. This commit has the following changes: - vop_stddeallocate and shm_deallocate to bound the the affected area further by the file size. - The EINVAL case for rqsr.r_offset + rqsr.r_len > OFF_MAX is documented. - The fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9)'s return len is explicitly documented the be the value 0, and the return offset is restricted to be the smallest of off + len and current file size suggested by kib@. This semantic allows callers to interact better with potential file size growth after the call. Sponsored by: The FreeBSD Foundation Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D31604
# 454bc887	12-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	uipc_shm: Implements fspacectl(2) support This implements fspacectl(2) support on shared memory objects. The semantic of SPACECTL_DEALLOC is equivalent to clearing the backing store and free the pages within the affected range. If the call succeeds, subsequent reads on the affected range return all zero. tests/sys/posixshm/posixshm_tests.c is expanded to include a fspacectl(2) functional test. Sponsored by: The FreeBSD Foundation Reviewed by: kevans, kib Differential Revision: https://reviews.freebsd.org/D31490
# d474440a	03-May-2021	Konstantin Belousov <kib@FreeBSD.org>	Constify vm_pager-related virtual tables. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070
# 85078b85	17-Nov-2020	Conrad Meyer <cem@FreeBSD.org>	Split out cwd/root/jail, cmask state from filedesc table No functional change intended. Tracking these structures separately for each proc enables future work to correctly emulate clone(2) in linux(4). __FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof. Reviewed by: kib Discussed with: markj, mjg Differential Revision: https://reviews.freebsd.org/D27037
# 78257765	23-Sep-2020	Mark Johnston <markj@FreeBSD.org>	Add a vmparam.h constant indicating pmap support for large pages. Enable SHM_LARGEPAGE support on arm64. Reviewed by: alc, kib Sponsored by: Juniper Networks, Inc., Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26467
# f9cc8410	18-Sep-2020	Eric van Gyzen <vangyzen@FreeBSD.org>	vm_ooffset_t is now unsigned vm_ooffset_t is now unsigned. Remove some tests for negative values, or make other adjustments accordingly. Reported by: Coverity Reviewed by: kib markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26214
# 79783634	10-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Fix interaction between largepages and seals/writes. On write with SHM_GROW_ON_WRITE, use proper truncate. Do not allow to grow largepage shm if F_SEAL_GROW is set. Note that shrinks are not supported at all due to unmanaged mappings. Call to vm_pager_update_writecount() is only valid for swap objects, skip it for unmanaged largepages. Largepages cannot support write sealing. Do not writecnt largepage mappings. Reported by: kevans Reviewed by: kevans, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26394
# d301b358	09-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Support for userspace non-transparent superpages (largepages). Created with shm_open2(SHM_LARGEPAGE) and then configured with FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee that all userspace mappings of it are served by superpage non-managed mappings. Only amd64 for now, both 2M and 1G superpages can be requested, the later requires CPU feature. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652
# 25f44824	09-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	uipc_shm.c: Move comment where it belongs. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652
# 6fed89b1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	kern: clean up empty lines in .c and .h files
# 5dd47b52	31-Aug-2020	Kyle Evans <kevans@FreeBSD.org>	posixshm: fix setting of shm_flags Noted in D24652, we currently set shmfd->shm_flags on every shm_open()/shm_open2(). This wasn't properly thought out; one shouldn't be able to specify incompatible flags on subsequent opens of non-anon shm. Move setting of shm_flags explicitly to the two places shmfd are created, as we do with seals, and validate when we're opening a pre-existing mapping that we've either passed no flags or we've passed the exact same flags as the first time. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D26242
# d292b194	05-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the obsolete privused argument from vaccess This brings argument count down to 6, which is passable without the stack on amd64.
# 3f07b9d9	09-Jul-2020	Kyle Evans <kevans@FreeBSD.org>	shm_open2: Implement SHM_GROW_ON_WRITE Lack of SHM_GROW_ON_WRITE is actively breaking Python's memfd_create tests, so go ahead and implement it. A future change will make memfd_create always set SHM_GROW_ON_WRITE, to match Linux behavior and unbreak Python's tests on -CURRENT. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D25502
# 84242cf6	25-Jun-2020	Mark Johnston <markj@FreeBSD.org>	Call swap_pager_freespace() from vm_object_page_remove(). All vm_object_page_remove() callers, except linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when removing a range of pages from an object. The LinuxKPI case appears to be an unintentional omission that could result in leaked swap blocks, so unconditionally free swap space in vm_object_page_remove() to protect against similar bugs in the future. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25329
# 51a16c84	14-Apr-2020	Kyle Evans <kevans@FreeBSD.org>	posixshm: fix counting of writable mappings Similar to mmap'ing vnodes, posixshm should count any mapping where maxprot contains VM_PROT_WRITE (i.e. fd opened r/w with no write-seal applied) as writable and thus blocking of any write-seal. The memfd tests have been amended to reflect the fixes here, which notably includes: 1. Fix for error return bug; EPERM is not a documented failure mode for mmap 2. Fix rejection of write-seal with active mappings that can be upgraded via mprotect(2). Reported by: markj Discussed with: markj, kib
# c7841c6b	13-Apr-2020	Mark Johnston <markj@FreeBSD.org>	Relax restrictions on private mappings of POSIX shm objects. When creating a private mapping of a POSIX shared memory object, VM_PROT_WRITE should always be included in maxprot regardless of permissions on the underlying FD. Otherwise it is possible to open a shm object read-only, map it with MAP_PRIVATE and PROT_WRITE, and violate the invariant in vm_map_insert() that (prot & maxprot) == prot. Reported by: syzkaller Reviewed by: kevans, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24398
# 4cf919ed	02-Mar-2020	Mark Johnston <markj@FreeBSD.org>	Fix the malloc type used in sys_shm_unlink() after r354808. PR: 244563 Reported by: swills
# f72eaaeb	28-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Use unlocked grab for uipc_shm/tmpfs. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23865
# d6e13f3b	19-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033
# 39eae263	08-Jan-2020	Kyle Evans <kevans@FreeBSD.org>	shmfd: posix_fallocate(2): only take rangelock for section we need Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX for safety, so we still get proper synchronization of shmfd->shm_size in effect. There's no need to block readers/writers of earlier segments when we're just reserving more space, so narrow the scope -- it would likely be safe to narrow it completely to just the section of the range that extends beyond our current size, but this likely isn't worth it since the size isn't stable until the writelock is granted the first time. Suggested by: cem (passing comment)
# f1040532	08-Jan-2020	Kyle Evans <kevans@FreeBSD.org>	posixshm: implement posix_fallocate(2) Linux expects to be able to use posix_fallocate(2) on a memfd. Other places would use this with shm_open(2) to act as a smarter ftruncate(2). Test has been added to go along with this. Reviewed by: kib (earlier version) Differential Revision: https://reviews.freebsd.org/D23042
# 535b1df9	04-Jan-2020	Kyle Evans <kevans@FreeBSD.org>	shm: correct KPI mistake introduced around memfd_create When file sealing and shm_open2 were introduced, we should have grown a new kern_shm_open2 helper that did the brunt of the work with the new interface while kern_shm_open remains the same. Instead, more complexity was introduced to kern_shm_open to handle the additional features and consumers had to keep changing in somewhat awkward ways, and a kern_shm_open2 was added to wrap kern_shm_open. Backpedal on this and correct the situation- kern_shm_open returns to the interface it had prior to file sealing being introduced, and neither function needs an initial_seals argument anymore as it's handled in kern_shm_open2 based on the shmflags.
# 58366f05	04-Jan-2020	Kyle Evans <kevans@FreeBSD.org>	shmfd/mmap: restrict maxprot with MAP_SHARED + F_SEAL_WRITE If a write seal is set on a shared mapping, we must exclude VM_PROT_WRITE as the fd is effectively read-only. This was discovered by running devel/linux-ltp, which mmap's with acceptable protections specified then attempts to raise to PROT_READ\|PROT_WRITE with mprotect(2), which we allowed. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22978
# 9f5632e6	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885
# d29f674f	14-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Fix a mistake in r355765. We need to activate the page if it is not yet on a pagequeue. Reported by: pho
# a8081778	14-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Add a deferred free mechanism for freeing swap space that does not require an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654
# 63967687	19-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates reudundant complicated checks and additional locking required only for anonymous memory. Introduce vm_object_allocate_anon() to create these objects. DEFAULT and SWAP objects now have the correct settings for non-anonymous consumers and so individual consumers need not modify the default flags to create super-pages and avoid ONEMAPPING/NOSPLIT. Reviewed by: alc, dougm, kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22119
# 2d5603fe	18-Nov-2019	David Bright <dab@FreeBSD.org>	Jail and capability mode for shm_rename; add audit support for shm_rename Co-mingling two things here: * Addressing some feedback from Konstantin and Kyle re: jail, capability mode, and a few other things * Adding audit support as promised. The audit support change includes a partial refresh of OpenBSM from upstream, where the change to add shm_rename has already been accepted. Matthew doesn't plan to work on refreshing anything else to support audit for those new event types. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: kib Relnotes: Yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22083
# 0012f373	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594
# 63e97555	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548
# 5a391b57	01-Oct-2019	Kyle Evans <kevans@FreeBSD.org>	shm_open2(2): completely unbreak kern_shm_open2(), since conception, completely fails to pass the mode along to kern_shm_open(). This breaks most uses of it. Add tests alongside this that actually check the mode of the returned files. PR: 240934 [pulseaudio breakage] Reported by: ler, Andrew Gierth [postgres breakage] Diagnosed by: Andrew Gierth (great catch) Tested by: ler, tmunro Pointy hat to: kevans
# 9afb12ba	26-Sep-2019	David Bright <dab@FreeBSD.org>	Add an shm_rename syscall Add an atomic shm rename operation, similar in spirit to a file rename. Atomically unlink an shm from a source path and link it to a destination path. If an existing shm is linked at the destination path, unlink it as part of the same atomic operation. The caller needs the same permissions as shm_unlink to the shm being renamed, and the same permissions for the shm at the destination which is being unlinked, if it exists. If those fail, EACCES is returned, as with the other shm_* syscalls. truss support is included; audit support will come later. This commit includes only the implementation; the sysent-generated bits will come in a follow-on commit. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: jilles (earlier revision) Reviewed by: brueffer (manpages, earlier revision) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21423
# a9ac5e14	25-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	sysent: regenerate after r352705 This also implements it, fixes kdump, and removes no longer needed bits from lib/libc/sys/shm_open.c for the interim.
# 20f70576	25-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	Add a shm_open2 syscall to support upcoming memfd_create shm_open2 allows a little more flexibility than the original shm_open. shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate shmflag argument that can be expanded later. Currently the only shmflag is to allow file sealing on the returned fd. shm_open and memfd_create will both be implemented in libc to use this new syscall. __FreeBSD_version is bumped to indicate the presence. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21393
# 0cd95859	25-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	[2/3] Add an initial seal argument to kern_shm_open() Now that flags may be set on posixshm, add an argument to kern_shm_open() for the initial seals. To maintain past behavior where callers of shm_open(2) are guaranteed to not have any seals applied to the fd they're given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special flag could be opened later for shm_open(2) to indicate that sealing should be allowed. We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a shmfd that already existed. A note's been added about the assumptions we've made here as a hint towards anyone wanting to allow other seals to be applied at creation. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21392
# af755d3e	25-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	[1/3] Add mostly Linux-compatible file sealing support File sealing applies protections against certain actions (currently: write, growth, shrink) at the inode level. New fileops are added to accommodate seals - EINVAL is returned by fcntl(2) if they are not implemented. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D21391
# c7575748	10-Sep-2019	Jeff Roberson <jeff@FreeBSD.org>	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546
# fee2a2fa	09-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486
# dca52ab4	03-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	posixshm: start counting writeable mappings r351650 switched posixshm to using OBJT_SWAP for shm_object r351795 added support to the swap_pager for tracking writeable mappings Take advantage of this and start tracking writeable mappings; fd sealing will use this to reject a seal on writing with EBUSY if any such mapping exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456
# 32287ea7	31-Aug-2019	Kyle Evans <kevans@FreeBSD.org>	posixshm: switch to OBJT_SWAP in advance of other changes Future changes to posixshm will start tracking writeable mappings in order to support file sealing. Tracking writeable mappings for an OBJT_DEFAULT object is complicated as it may be swapped out and converted to an OBJT_SWAP. One may generically add this tracking for vm_object, but this is difficult to do without increasing memory footprint of vm_object and blowing up memory usage by a significant amount. On the other hand, the swap pager can be expanded to track writeable mappings without increasing vm_object size. This change is currently in D21456. Switch over to OBJT_SWAP in advance of the other changes to the swap pager and posixshm.
# b5d239cb	28-Aug-2019	Mark Johnston <markj@FreeBSD.org>	Wire pages in vm_page_grab() when appropriate. uiomove_object_page() and exec_map_first_page() would previously wire a page after having grabbed it. Ask vm_page_grab() to perform the wiring instead: this removes some redundant code, and is cheaper in the case where the requested page is not resident since the page allocator can be asked to initialize the page as wired, whereas a separate vm_page_wire() call requires the page lock. In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued before returning, but this is unnecessary since vm_page_free() will trigger a batched dequeue of the page. Reviewed by: alc, kib Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21440
# b5a7ac99	31-Jul-2019	Kyle Evans <kevans@FreeBSD.org>	kern_shm_open: push O_CLOEXEC into caller control The motivation for this change is to allow wrappers around shm to be written that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1(). Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from the context. sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX requirements, and a comment has been dropped in to kern_fd_open() to explain the situation and add a pointer to where O_CLOEXEC setting is maintained for shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally sets O_CLOEXEC to match previous behavior. This also has the side-effect of making flags correctly reflect the O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a glance-over leads me to believe that it didn't really matter. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21119
# 91898857	29-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Avoid relying on header pollution from sys/refcount.h. MFC after: 3 days Sponsored by: The FreeBSD Foundation
# eeacb3b0	08-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247
# 5c066cd2	30-May-2019	Konstantin Belousov <kib@FreeBSD.org>	Remove TODO comment after posixshmcontrol(1) added. Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 56d0e33e	22-May-2019	Konstantin Belousov <kib@FreeBSD.org>	Add a kern.ipc.posix_shm_list sysctl. The sysctl provides the listing on named linked posix shared memory segments existing in the system. Reuse shm_fill_kinfo() for filling individual struct kinfo_file. Remove unneeded lock around reading of shmfd->shm_mode. Reviewed by: jilles, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20258
# e4b77548	22-May-2019	Konstantin Belousov <kib@FreeBSD.org>	Report ref count of the backing object as st_nlink for posix shm fd. Unless there are transient references to the object, the ref count is equal to the number of the shared memory segment mappings plus one. Reviewed by: jilles, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20258
# 2b64ab22	28-Feb-2019	Mark Johnston <markj@FreeBSD.org>	Allow FIONBIO and FIOASYNC ioctls on POSIX shm descriptors. They have no effect, as with filesystem file descriptors. This improves compatibility with some existing userspace code. Submitted by: Greg V <greg@unrelenting.technology> Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19330
# cc426dd3	11-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation
# 7883ce1f	21-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	uipc_shm: use unr64 for inode numbers Sponsored by: The FreeBSD Foundation
# 8a36da99	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/kern: adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.
# 8d6fbbb8	07-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon
# 0c0e1e96	02-Oct-2017	Alan Cox <alc@FreeBSD.org>	Use vm_page_active() rather than directly accessing the page's queue field. Reviewed by: kib, markj MFC after: 2 weeks X-MFC with: r324146
# 0ffc7ed7	30-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Have uiomove_object_page() keep accessed pages in the active queue. Previously, uiomove_object_page() would maintain LRU by requeuing the accessed page. This involves acquiring one of the heavily contended page queue locks. Moreover, it is unnecessarily expensive for pages in the active queue. As of r254304 the page daemon continually performs a slow scan of the active queue, with the effect that unreferenced pages are gradually moved to the inactive queue, from which they can be reclaimed. Prior to that revision, the active queue was scanned only during shortages of free and inactive pages, meaning that unreferenced pages could get "stuck" in the queue. Thus, tmpfs was required to use the inactive queue and requeue pages in order to maintain LRU. Now that this is no longer the case, tmpfs I/O operations can use the active queue and avoid the page queue locks in most cases, instead setting PGA_REFERENCED on referenced pages to provide pseudo-LRU. Reviewed by: alc (previous version) MFC after: 2 weeks
# 34d3e89f	27-Jun-2017	Konstantin Belousov <kib@FreeBSD.org>	Do not ignore an error from vm_mmap_object(). Found and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 15bcf785	31-Mar-2017	Robert Watson <rwatson@FreeBSD.org>	Audit arguments to POSIX message queues, semaphores, and shared memory. This requires minor changes to the audit framework to allow capturing paths that are not filesystem paths (i.e., will not be canonicalised relative to the process current working directory and/or filesystem root). Obtained from: TrustedBSD Project MFC after: 3 weeks Sponsored by: DARPA, AFRL
# 2a016de1	19-Mar-2017	Alan Cox <alc@FreeBSD.org>	Use IDX_TO_OFF(), not ptoa(), when converting the difference between two vm_pindex_t's into a vm_ooffset_t. The length given to shm_dotruncate() must never be negative. Assert this. Tidy up a comment. Reviewed by: kib MFC after: 1 week
# 987ff181	12-Feb-2017	Konstantin Belousov <kib@FreeBSD.org>	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 2d612d2d	11-Dec-2016	Alan Cox <alc@FreeBSD.org>	When tmpfs and POSIX shm pagein a page for the sole purpose of performing truncation, immediately queue the page for asynchronous laundering rather than making the page pass through inactive queue first. Reviewed by: kib, markj
# 7667839a	15-Nov-2016	Alan Cox <alc@FreeBSD.org>	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497
# ce3ee09b	14-Aug-2016	Alan Cox <alc@FreeBSD.org>	Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when neither vm_pager_has_page() nor vm_pager_get_pages() is called. Reviewed by: kib, markj MFC after: 3 weeks
# 6ea906ee	23-Jun-2016	Jilles Tjoelker <jilles@FreeBSD.org>	posixshm: Fix lock leak when mac_posixshm_check_read rejects read. While reading the code, I noticed that shm_read() returns without unlocking foffset and rangelock if mac_posixshm_check_read() rejects the read. Reviewed by: kib, jhb, rwatson Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6927
# 55e0987a	26-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: extend use of the howmany() macro when available. We have a howmany() macro in the <sys/param.h> header that is convenient to re-use as it makes things easier to read.
# 44c16975	14-Apr-2016	Jamie Gritton <jamie@FreeBSD.org>	Clean up some style(9) violations.
# cc7b259a	13-Apr-2016	Jamie Gritton <jamie@FreeBSD.org>	Separate POSIX sem/shm objects in jails, by prepending the jail's path name to the object's "path". While the objects don't have real path names, it's a filesystem-like namespace, which allows jails to be kept to their own space, but still allows the system / jail parent to access a jail's IPC. PR: 208082
# 1bdbd705	28-Feb-2016	Konstantin Belousov <kib@FreeBSD.org>	Implement process-shared locks support for libthr.so.3, without breaking the ABI. Special value is stored in the lock pointer to indicate shared lock, and offline page in the shared memory is allocated to store the actual lock. Reviewed by: vangyzen (previous version) Discussed with: deischen, emaste, jhb, rwatson, Martin Simmons <martin@lispworks.com> Tested by: pho Sponsored by: The FreeBSD Foundation
# b0cd2017	16-Dec-2015	Gleb Smirnoff <glebius@FreeBSD.org>	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 7ee1b208	01-Aug-2015	Ed Schouten <ed@FreeBSD.org>	Add kern_shm_open(). This allows you to specify the capabilities that the new file descriptor should have. This allows us to create shared memory objects that only have the rights we're interested in. The idea behind restricting the rights is that it makes it a lot easier for CloudABI to get consistent behaviour across different operating systems. We only need to make sure that a shared memory implementation consistently implements the operations that are whitelisted. Approved by: kib Obtained from: https://github.com/NuxiNL/freebsd
# 093c7f39	12-Jun-2015	Gleb Smirnoff <glebius@FreeBSD.org>	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 7077c426	04-Jun-2015	John Baldwin <jhb@FreeBSD.org>	Add a new file operations hook for mmap operations. File type-specific logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio
# b9062c93	24-Apr-2015	Konstantin Belousov <kib@FreeBSD.org>	Use correct length for sparse uiomove(). It must be the clipped to the page size, len is the total transfer length, which may be larger than zero_region. Reported and tested by: clusteradm (gjb) Sponsored by: The FreeBSD Foundation X-MFC-With: r281442
# 6311d7aa	11-Apr-2015	Will Andrews <will@FreeBSD.org>	uiomove_object_page(): Avoid instantiating pages in sparse regions on reads. Check whether the page being requested is either resident or on swap. If not, read from the zero_region instead of instantiating an unnecessary page. This avoids consuming memory for sparse files on tmpfs, when they are read by applications that do not use SEEK_HOLE/SEEK_DATA (which is most of them). Reviewed by: kib MFC after: 1 week Sponsored by: Spectra Logic
# 90f54cbf	11-Apr-2015	Mateusz Guzik <mjg@FreeBSD.org>	fd: remove filedesc argument from fdclose Just accept a thread instead. This makes it consistent with fdalloc. No functional changes.
# f4c6aea3	08-Feb-2015	Alan Cox <alc@FreeBSD.org>	Preset the object's color, or alignment, to maximize superpage usage. MFC after: 5 days
# 9696feeb	22-Sep-2014	John Baldwin <jhb@FreeBSD.org>	Add a new fo_fill_kinfo fileops method to add type-specific information to struct kinfo_file. - Move the various fill_*_info() methods out of kern_descrip.c and into the various file type implementations. - Rework the support for kinfo_ofile to generate a suitable kinfo_file object for each file and then convert that to a kinfo_ofile structure rather than keeping a second, different set of code that directly manipulates type-specific file information. - Remove the shm_path() and ksem_info() layering violations. Differential Revision: https://reviews.freebsd.org/D775 Reviewed by: kib, glebius (earlier version)
# 2d69d0dc	12-Sep-2014	John Baldwin <jhb@FreeBSD.org>	Fix various issues with invalid file operations: - Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(), and invfo_kqfilter() for use by file types that do not support the respective operations. Home-grown versions of invfo_poll() were universally broken (they returned an errno value, invfo_poll() uses poll_no_poll() to return an appropriate event mask). Home-grown ioctl routines also tended to return an incorrect errno (invfo_ioctl returns ENOTTY). - Use the invfo_() functions instead of local versions for unsupported file operations. - Reorder fileops members to match the order in the structure definition to make it easier to spot missing members. - Add several missing methods to linuxfileops used by the OFED shim layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most of these used invfo_(), but a dummy fo_stat() implementation was added.
# 5be725d7	29-Aug-2014	Andreas Tobler <andreast@FreeBSD.org>	Rename shm_dict_init to shm_init to fix a compiler warning. Reviewed by: jhb
# 610a2b3c	29-Aug-2014	John Baldwin <jhb@FreeBSD.org>	Use a unit number allocator to provide suitable st_dev and st_ino values for POSIX shared memory descriptors. The implementation is similar to that used for pipes. MFC after: 1 week
# 70978c93	12-Aug-2014	Konstantin Belousov <kib@FreeBSD.org>	If vm_page_grab() allocates a new page, the page is not inserted into page queue even when the allocation is not wired. It is responsibility of the vm_page_grab() caller to ensure that the page does not end on the vm_object queue but not on the pagedaemon queue, which would effectively create unpageable unwired page. In exec_map_first_page() and vm_imgact_hold_page(), activate the page immediately after unbusying it, to avoid leak. In the uiomove_object_page(), deactivate page before the object is unlocked. There is no leak, since the page is deactivated after uiomove_fromphys() finished. But allowing non-queued non-wired page in the unlocked object queue makes it impossible to assert that leak does not happen in other places. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 551a7895	01-Aug-2014	Rui Paulo <rpaulo@FreeBSD.org>	In the shm_open() and shm_unlink() syscalls, export the path to KTR. MFC after: 1 week
# 5d9b4508	28-Jul-2014	Konstantin Belousov <kib@FreeBSD.org>	For md(4), posix shm(3) and tmpfs(5), free swap space used by paged in dirty page, which is written by the process. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 4a144410	16-Mar-2014	Robert Watson <rwatson@FreeBSD.org>	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks
# 6f2b769c	15-Mar-2014	John-Mark Gurney <jmg@FreeBSD.org>	change td_retval into a union w/ off_t, with defines to mask the change... This eliminates a cast, and also forces td_retval (often 2 32-bit registers) to be aligned so that off_t's can be stored there on arches with strict alignment requirements like armeb (AVILA)... On i386, this doesn't change alignment, and on amd64 it doesn't either, as register_t is already 64bits... This will also prevent future breakage due to people adding additional fields to the struct... This gets AVILA booting a bit farther... Reviewed by: bde
# f55a7d30	24-Jan-2014	Robert Millan <rmh@FreeBSD.org>	Accept O_CLOEXEC in shm_open(). Reviewed by: jilles, jhb MFC after: 1 week
# 227aaa86	11-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Implement sendfile(2) for the posix shared memory segment file descriptor, in addition to the regular files. Requested by: alc Discussed with: emaste Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Approved by: re (hrs)
# edb572a3	09-Sep-2013	John Baldwin <jhb@FreeBSD.org>	Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use an address in the first 2GB of the process's address space. This flag should have the same semantics as the same flag on Linux. To facilitate this, add a new parameter to vm_map_find() that specifies an optional maximum virtual address. While here, fix several callers of vm_map_find() to use a VMFS_* constant for the findspace argument instead of TRUE and FALSE. Reviewed by: alc Approved by: re (kib)
# 5944de8e	22-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 940cb0e2	21-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Implement read(2)/write(2) and neccessary lseek(2) for posix shmfd. Add MAC framework entries for posix shm read and write. Do not allow implicit extension of the underlying memory segment past the limit set by ftruncate(2) by either of the syscalls. Read and write returns short i/o, lseek(2) fails with EINVAL when resulting offset does not fit into the limit. Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# 41cf41fd	21-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Extract the general-purpose code from tmpfs to perform uiomove from the page queue of some vm object. Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# ca04d21d	15-Aug-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Make sendfile() a method in the struct fileops. Currently only vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix
# c7aebda8	09-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 5e3a17c0	24-Jul-2013	John Baldwin <jhb@FreeBSD.org>	Use VMFS_OPTIMAL_SPACE instead of VMFS_ALIGNED_SPACE in shm_map().
# b68cf25f	07-Apr-2013	Jilles Tjoelker <jilles@FreeBSD.org>	mqueue,ksem,shm: Fix race condition with setting UF_EXCLOSE. POSIX mqueue, compatibility ksem and POSIX shm create a file descriptor that has close-on-exec set. However, they do this incorrectly, leaving a window where a thread may fork and exec while the flag has not been set yet. The race is easily reproduced on a multicore system with one thread doing shm_open and close and another thread doing posix_spawnp and waitpid. Set UF_EXCLOSE via falloc()'s flags argument instead. This also simplifies the code. MFC after: 1 week
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 2609222a	01-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib
# e506e182	01-Apr-2012	John Baldwin <jhb@FreeBSD.org>	Export some more useful info about shared memory objects to userland via procstat(1) and fstat(1): - Change shm file descriptors to track the pathname they are associated with and add a shm_path() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and shm_path() to export the path, mode, and size of a shared memory object via struct kinfo_file. - Add a struct shmstat to the libprocstat(3) interface along with a procstat_get_shm_info() to export the mode and size of a shared memory object. - Change procstat to always print out the path for a given object if it is valid. - Teach fstat about shared memory objects and to display their path, mode, and size. MFC after: 2 weeks
# 2971897d	08-Jan-2012	Alan Cox <alc@FreeBSD.org>	Correct an error of omission in the implementation of the truncation operation on POSIX shared memory objects and tmpfs. Previously, neither of these modules correctly handled the case in which the new size of the object or file was not a multiple of the page size. Specifically, they did not handle partial page truncation of data stored on swap. As a result, stale data might later be returned to an application. Interestingly, a data inconsistency was less likely to occur under tmpfs than POSIX shared memory objects. The reason being that a different mistake by the tmpfs truncation operation helped avoid a data inconsistency. If the data was still resident in memory in a PG_CACHED page, then the tmpfs truncation operation would reactivate that page, zero the truncated portion, and leave the page pinned in memory. More precisely, the benevolent error was that the truncation operation didn't add the reactivated page to any of the paging queues, effectively pinning the page. This page would remain pinned until the file was destroyed or the page was read or written. With this change, the page is now added to the inactive queue. Discussed with: jhb Reviewed by: kib (an earlier version) MFC after: 3 weeks
# 338e7cf2	15-Dec-2011	John Baldwin <jhb@FreeBSD.org>	Use vm_mmap_to_errno(). Submitted by: kib
# fb680e16	14-Dec-2011	John Baldwin <jhb@FreeBSD.org>	Add a helper API to allow in-kernel code to map portions of shared memory objects created by shm_open(2) into the kernel's address space. This provides a convenient way for creating shared memory buffers between userland and the kernel without requiring custom character devices.
# dc874f98	30-Nov-2011	Konstantin Belousov <kib@FreeBSD.org>	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc
# 8451d0dd	16-Sep-2011	Kip Macy <kmacy@FreeBSD.org>	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
# 9b6dd12e	02-Sep-2011	Robert Watson <rwatson@FreeBSD.org>	Correct several issues in the integration of POSIX shared memory objects and the new setmode and setowner fileops in FreeBSD 9.0: - Add new MAC Framework entry point mac_posixshm_check_create() to allow MAC policies to authorise shared memory use. Provide a stub policy and test policy templates. - Add missing Biba and MLS implementations of mac_posixshm_check_setmode() and mac_posixshm_check_setowner(). - Add 'accmode' argument to mac_posixshm_check_open() -- unlike the mac_posixsem_check_open() entry point it was modeled on, the access mode is required as shared memory access can be read-only as well as writable; this isn't true of POSIX semaphores. - Implement full range of POSIX shared memory entry points for Biba and MLS. Sponsored by: Google Inc. Obtained from: TrustedBSD Project Approved by: re (kib)
# 68889ed6	16-Aug-2011	Konstantin Belousov <kib@FreeBSD.org>	Fix build breakage. Initialize error variables explicitely for !MAC case. Pointy hat to: kib Approved by: re (bz)
# 9c00bb91	16-Aug-2011	Konstantin Belousov <kib@FreeBSD.org>	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)
# 12bc222e	30-Jun-2011	Jonathan Anderson <jonathan@FreeBSD.org>	Add some checks to ensure that Capsicum is behaving correctly, and add some more explicit comments about what's going on and what future maintainers need to do when e.g. adding a new operation to a sys_machdep.c. Approved by: mentor(rwatson), re(bz)
# 6bbee8e2	29-Jun-2011	Alan Cox <alc@FreeBSD.org>	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib
# 1fe80828	01-Apr-2011	Konstantin Belousov <kib@FreeBSD.org>	After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9) and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not
# ef694c1a	02-Dec-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Replace pointer to "struct uidinfo" with pointer to "struct ucred" in "struct vm_object". This is required to make it possible to account for per-jail swap usage. Reviewed by: kib@ Tested by: pho@ Sponsored by: FreeBSD Foundation
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# c8fa8709	02-Jun-2010	Alan Cox <alc@FreeBSD.org>	Minimize the use of the page queues lock for synchronizing access to the page's dirty field. With the exception of one case, access to this field is now synchronized by the object lock.
# 510ea843	28-Mar-2010	Ed Schouten <ed@FreeBSD.org>	Rename st_timespec fields to st_tim for POSIX 2008 compliance. A nice thing about POSIX 2008 is that it finally standardizes a way to obtain file access/modification/change times in sub-second precision, namely using struct timespec, which we already have for a very long time. Unfortunately POSIX uses different names. This commit adds compatibility macros, so existing code should still build properly. Also change all source code in the kernel to work without any of the compatibility macros. This makes it all a less ambiguous. I am also renaming st_birthtime to st_birthtim, even though it was a local extension anyway. It seems Cygwin also has a st_birthtim.
# 3364c323	23-Jun-2009	Konstantin Belousov <kib@FreeBSD.org>	Implement global and per-uid accounting of the anonymous memory. Add rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 3c33df62	02-Jun-2009	Alan Cox <alc@FreeBSD.org>	Correct a boundary case error in the management of a page's dirty bits by shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of a shared memory object or a file is truncated such that the length modulo the page size is between 1 and 511, then all of the page's dirty bits were cleared. Now, a dirty bit is cleared only if the corresponding block is truncated in its entirety.
# 6ee7dd87	01-Dec-2008	Alexander Kabaev <kan@FreeBSD.org>	Shared memory objects that have size which is not necessarily equal to exact multiple of system page size should still be allowed to be mapped in their entirety to match the regular vnode backed file behavior. Reported by: ed Reviewed by: jhb
# 15bc6b2b	28-Oct-2008	Edward Tomasz Napierala <trasz@FreeBSD.org>	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 6bc1e9cd	26-Jun-2008	John Baldwin <jhb@FreeBSD.org>	Rework the lifetime management of the kernel implementation of POSIX semaphores. Specifically, semaphores are now represented as new file descriptor type that is set to close on exec. This removes the need for all of the manual process reference counting (and fork, exec, and exit event handlers) as the normal file descriptor operations handle all of that for us nicely. It is also suggested as one possible implementation in the spec and at least one other OS (OS X) uses this approach. Some bugs that were fixed as a result include: - References to a named semaphore whose name is removed still work after the sem_unlink() operation. Prior to this patch, if a semaphore's name was removed, valid handles from sem_open() would get EINVAL errors from sem_getvalue(), sem_post(), etc. This fixes that. - Unnamed semaphores created with sem_init() were not cleaned up when a process exited or exec'd. They were only cleaned up if the process did an explicit sem_destroy(). This could result in a leak of semaphore objects that could never be cleaned up. - On the other hand, if another process guessed the id (kernel pointer to 'struct ksem' of an unnamed semaphore (created via sem_init)) and had write access to the semaphore based on UID/GID checks, then that other process could manipulate the semaphore via sem_destroy(), sem_post(), sem_wait(), etc. - As part of the permission check (UID/GID), the umask of the proces creating the semaphore was not honored. Thus if your umask denied group read/write access but the explicit mode in the sem_init() call allowed it, the semaphore would be readable/writable by other users in the same group, for example. This includes access via the previous bug. - If the module refused to unload because there were active semaphores, then it might have deregistered one or more of the semaphore system calls before it noticed that there was a problem. I'm not sure if this actually happened as the order that modules are discovered by the kernel linker depends on how the actual .ko file is linked. One can make the order deterministic by using a single module with a mod_event handler that explicitly registers syscalls (and deregisters during unload after any checks). This also fixes a race where even if the sem_module unloaded first it would have destroyed locks that the syscalls might be trying to access if they are still executing when they are unloaded. XXX: By the way, deregistering system calls doesn't do any blocking to drain any threads from the calls. - Some minor fixes to errno values on error. For example, sem_init() isn't documented to return ENFILE or EMFILE if we run out of semaphores the way that sem_open() can. Instead, it should return ENOSPC in that case. Other changes: - Kernel semaphores now use a hash table to manage the namespace of named semaphores nearly in a similar fashion to the POSIX shared memory object file descriptors. Kernel semaphores can now also have names longer than 14 chars (up to MAXPATHLEN) and can include subdirectories in their pathname. - The UID/GID permission checks for access to a named semaphore are now done via vaccess() rather than a home-rolled set of checks. - Now that kernel semaphores have an associated file object, the various MAC checks for POSIX semaphores accept both a file credential and an active credential. There is also a new posixsem_check_stat() since it is possible to fstat() a semaphore file descriptor. - A small set of regression tests (using the ksem API directly) is present in src/tools/regression/posixsem. Reported by: kris (1) Tested by: kris Reviewed by: rwatson (lightly) MFC after: 1 month
# e384d8a8	13-Apr-2008	Alan Cox <alc@FreeBSD.org>	Initialize the vm object's flags to include OBJ_NOSPLIT, just like the vm objects that are used by System V shared memory segments.
# fb73a5ab	06-Feb-2008	Alan Cox <alc@FreeBSD.org>	Change shm_dotruncate() so that it correctly handles cached pages that span the end of the object. (This change is analogous to revision 1.237 of vm/vnode_pager.c.) Discussed with: jhb
# 8ffbe155	16-Jan-2008	John Baldwin <jhb@FreeBSD.org>	Add a set of regression tests for the POSIX shm API (shm_open(2) and shm_unlink(2)).
# 8e38aeff	08-Jan-2008	John Baldwin <jhb@FreeBSD.org>	Add a new file descriptor type for IPC shared memory objects and use it to implement shm_open(2) and shm_unlink(2) in the kernel: - Each shared memory file descriptor is associated with a swap-backed vm object which provides the backing store. Each descriptor starts off with a size of zero, but the size can be altered via ftruncate(2). The shared memory file descriptors also support fstat(2). read(2), write(2), ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared memory file descriptors. - shm_open(2) and shm_unlink(2) are now implemented as system calls that manage shared memory file descriptors. The virtual namespace that maps pathnames to shared memory file descriptors is implemented as a hash table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash of the pathname. - As an extension, the constant 'SHM_ANON' may be specified in place of the path argument to shm_open(2). In this case, an unnamed shared memory file descriptor will be created similar to the IPC_PRIVATE key for shmget(2). Note that the shared memory object can still be shared among processes by sharing the file descriptor via fork(2) or sendmsg(2), but it is unnamed. This effectively serves to implement the getmemfd() idea bandied about the lists several times over the years. - The backing store for shared memory file descriptors are garbage collected when they are not referenced by any open file descriptors or the shm_open(2) virtual namespace. Submitted by: dillon, peter (previous versions) Submitted by: rwatson (I based this on his version) Reviewed by: alc (suggested converting getmemfd() to shm_open())