History log of /freebsd-current/sys/vm/vm_mmap.c
Revision Date Author Comments
# 7893419d 04-Dec-2023 Brooks Davis <brooks@FreeBSD.org>

Remove never implemented sbrk and sstk syscalls

Both system calls were stubs returning EOPNOTSUPP and libc did not
provide _ or __sys_ prefixed symbols. The actual implementation of
sbrk(2) is on top of the undocumented break(2) system call.

Technically this is a change in ABI, but no non-contrived program ever
called these syscalls.

Reviewed by: kib, emaste
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D42872


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# f3e11927 14-Aug-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vm: Allow MAP_32BIT for all architectures

Reviewed by: alc, kib, markj
Differential revision: https://reviews.freebsd.org/D41435


# 0ddd32b6 14-Aug-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vm: MAP_32BIT_MAX_ADDR defined in sys/mman.h

Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D41434


# 37e5d49e 03-Aug-2023 Alan Cox <alc@FreeBSD.org>

vm: Fix address hints of 0 with MAP_32BIT

Also, rename min_addr to default_addr, which better reflects what it
represents. The min_addr is not a minimum address in the same way that
max_addr is actually a maximum address that can be allocated. For
example, a non-zero hint can be less than min_addr and be allocated.

Reported by: dchagin
Reviewed by: dchagin, kib, markj
Fixes: d8e6f4946cec0 "vm: Fix anonymous memory clustering under ASLR"
Differential Revision: https://reviews.freebsd.org/D41397


# 9b65fa69 29-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

linuxolator: implement Linux' PROT_GROWSDOWN

From the Linux man page for mprotect(2):
PROT_GROWSDOWN
Apply the protection mode down to the beginning of a mapping
that grows downward (which should be a stack segment or a
segment mapped with the MAP_GROWSDOWN flag set).

Reported by: dchagin
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41099


# 5ec2d94a 25-Jul-2023 Alan Cox <alc@FreeBSD.org>

vm_mmap_object: Update the spelling of true/false

Since fitit is already a bool, use true/false instead of TRUE/FALSE.

MFC after: 2 weeks


# d8e6f494 22-Jun-2023 Alan Cox <alc@FreeBSD.org>

vm: Fix anonymous memory clustering under ASLR

By default, our ASLR implementation is supposed to cluster anonymous
memory allocations, unless the application's mmap(..., MAP_ANON, ...)
call included a non-zero address hint. Unfortunately, clustering
never occurred because kern_mmap() always replaced the given address
hint when it was zero. So, the ASLR implementation always believed
that a non-zero hint had been provided and randomized the mapping's
location in the address space. To fix this problem, I'm pushing down
the point at which we convert a hint of zero to the minimum allocatable
address from kern_mmap() to vm_map_find_min().

Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40743


# 0cb2610e 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm: Remove handling for OBJT_DEFAULT objects

Now that OBJT_DEFAULT objects can't be instantiated, we can simplify
checks of the form object->type == OBJT_DEFAULT || (object->flags &
OBJ_SWAP) != 0. No functional change intended.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35788


# eee9aab9 13-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_mmap: Remove obsolete code and comments from vm_mmap()

In preparation for removing OBJT_DEFAULT, eliminate some stale/unhelpful
comments from vm_mmap(), and remove an unused case. In particular, the
remaining callers of vm_mmap() in the tree do not specify OBJT_DEFAULT.

It's much more common to use vm_map_find() to map an object into user
memory, so rather than adjusting vm_mmap() to handle OBJT_SWAP objects,
let's further discourage its use and simply remove OBJT_DEFAULT
handling.

Reviewed by: dougm, alc, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35778


# e123264e 19-Jun-2022 Mark Johnston <markj@FreeBSD.org>

vm: Fix racy checks for swap objects

Commit 4b8365d752ef introduced the ability to dynamically register
VM object types, for use by tmpfs, which creates swap-backed objects.
As a part of this, checks for such objects changed from

object->type == OBJT_DEFAULT || object->type == OBJT_SWAP

to

object->type == OBJT_DEFAULT || (object->flags & OBJ_SWAP) != 0

In particular, objects of type OBJT_DEFAULT do not have OBJ_SWAP set;
the swap pager sets this flag when converting from OBJT_DEFAULT to
OBJT_SWAP.

A few of these checks are done without the object lock held. It turns
out that this can result in false negatives since the swap pager
converts objects like so:

object->type = OBJT_SWAP;
object->flags |= OBJ_SWAP;

Fix the problem by adding explicit tests for OBJT_SWAP objects in
unlocked checks.

PR: 258932
Fixes: 4b8365d752ef ("Add OBJT_SWAP_TMPFS pager")
Reported by: bdrewery
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35470


# b1ad6a90 28-Mar-2022 Brooks Davis <brooks@FreeBSD.org>

syscallarg_t: Add a type for system call arguments

This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from: CheriBSD

Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780


# 0910a41e 12-Jan-2022 Brooks Davis <brooks@FreeBSD.org>

Revert "syscallarg_t: Add a type for system call arguments"

Missed issues in truss on at least armv7 and powerpcspe need to be
resolved before recommit.

This reverts commit 3889fb8af0b611e3126dc250ebffb01805152104.
This reverts commit 1544e0f5d1f1e3b8c10a64cb899a936976ca7ea4.


# 1544e0f5 12-Jan-2022 Brooks Davis <brooks@FreeBSD.org>

syscallarg_t: Add a type for system call arguments

This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from: CheriBSD

Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780


# 01ce7fca 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

ommap: fix signed len and pos arguments

4.3 BSD's mmap took an int len and long pos. Reject negative lengths
and in freebsd32 sign-extend pos correctly rather than mis-handling
negative positions as large positive ones.

Reviewed by: kib


# 4b8365d7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add OBJT_SWAP_TMPFS pager

This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.

Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.

This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 7a1591c1 22-Jan-2021 Brooks Davis <brooks@FreeBSD.org>

Rename kern_mmap_req to kern_mmap

Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap .

The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.

Obtained from: CheriBSD
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28292


# 0659df6f 12-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_map_protect: allow to set prot and max_prot in one go.

This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome. E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by: brooks, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28117


# d301b358 09-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Support for userspace non-transparent superpages (largepages).

Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.

Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# e8f77c20 09-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Prepare to handle non-trivial errors from vm_map_delete().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# 67a659d2 08-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# 847ab36b 02-Sep-2020 Mark Johnston <markj@FreeBSD.org>

Include the psind in data returned by mincore(2).

Currently we use a single bit to indicate whether the virtual page is
part of a superpage. To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.

The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl. MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.

For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.

Reviewed by: alc, kib
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26238


# c3aa3bf9 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: clean up empty lines in .c and .h files


# a92a971b 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the thread argument from vget

It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)


# 52c81be1 20-Jun-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add linux_madvise(2) instead of having Linux apps call the native
FreeBSD madvise(2) directly. While some of the flag values match,
most don't.

PR: kern/230160
Reported by: markj
Reviewed by: markj
Discussed with: brooks, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25272


# 0f1e6ec5 18-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Add a helper function for validating VA ranges.

Functions which take untrusted user ranges must validate against the
bounds of the map, and also check for wraparound. Instead of having the
same logic duplicated in a number of places, add a function to check.

Reviewed by: dougm, kib
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25328


# 4d13f784 03-Jun-2020 Ed Maste <emaste@FreeBSD.org>

Correct terminology in vm.imply_prot_max sysctl description

As with r361769 (man page), PROT_* are properly called protections, not
permissions.

MFC after: 1 week
MFC with: r361769
Sponsored by: The FreeBSD Foundation


# d718de81 04-Mar-2020 Brooks Davis <brooks@FreeBSD.org>

Introduce kern_mmap_req().

This presents an extensible interface to the generic mmap(2)
implementation via a struct pointer intended to use a designated
initializer or compount literal. We take advantage of the mandatory
zeroing of fields not listed in the initializer.

Remove kern_mmap_fpcheck() and use kern_mmap_req().

The motivation for this change is a desire to keep the core
implementation from growing an ever-increasing number of arguments
that must be specified in the correct order for the lowest-level
implementations. In CheriBSD we have already added two more arguments.

Reviewed by: kib
Discussed with: kevans
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D23164


# acb8858f 26-Feb-2020 Ed Maste <emaste@FreeBSD.org>

Return ENOTSUP for mmap/mprotect if prot not subset of prot_max

From POSIX,

[ENOTSUP]
The implementation does not support the combination of accesses
requested in the prot argument.

This fits the case that prot contains permissions which are not a subset
of prot_max.

Reviewed by: brooks, cem
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23843


# 3379d2f9 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: use new capsicum helpers


# 23ed568c 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: remove no longer needed atomic_load_ptr casts


# 643656cf 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace VOP_MARKATIME with VOP_MMAPPED

The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23422


# 2180f6c6 04-Jan-2020 Kyle Evans <kevans@FreeBSD.org>

kern_mmap: restore character deleted in transit

Pointy hat to: kevans
X-MFC-With: r356359


# 18348a23 04-Jan-2020 Kyle Evans <kevans@FreeBSD.org>

kern_mmap: add a variant that allows caller to inspect fp

Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.

This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.

The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.

If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22977


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# f2410510 29-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Avoid acquiring the object lock if color is already set. It can not be
unset until the object is recycled so this check is stable. Now that we
can acquire the ref without a lock it is not necessary to group these
operations and we can avoid it entirely in many cases.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22565


# 7cdcf863 13-Nov-2019 Doug Moore <dougm@FreeBSD.org>

Define wrapper functions vm_map_entry_{succ,pred} to act as wrappers
around entry->{next,prev} when those are used for ordered list
traversal, and use those wrapper functions everywhere. Where the next
field is used for maintaining a stack of deferred operations, #define
defer_next to make that different usage clearer, and then use the
'right' pointer instead of 'next' for that purpose.

Approved by: markj
Tested by: pho (as part of a larger patch)
Differential Revision: https://reviews.freebsd.org/D22347


# 01cef4ca 16-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking from pmap_mincore().

After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823


# d0c9294b 16-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Correct the range boundaries used by kern_mincore().

Reported by: alc
Sponsored by: Netflix


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# fe7bcbaf 03-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

vm pager: writemapping accounting for OBJT_SWAP

Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21456


# 5dc7e31a 02-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement. Provide procctl(2) verbs to set and query
implicit PROTMAX handling. The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by: emaste, markj (previous version)
Discussed with: brooks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D20795


# 37306951 02-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Use traditional 'p' local to designate td->td_proc in kern_mmap.

Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D20795


# 74a1b66c 20-Jun-2019 Brooks Davis <brooks@FreeBSD.org>

Extend mmap/mprotect API to specify the max page protections.

A new macro PROT_MAX() alters a protection value so it can be OR'd with
a regular protection value to specify the maximum permissions. If
present, these flags specify the maximum permissions.

While these flags are non-portable, they can be used in portable code
with simple ifdefs to expand PROT_MAX() to 0.

This change allows (e.g.) a region that must be writable during run-time
linking or JIT code generation to be made permanently read+execute after
writes are complete. This complements W^X protections allowing more
precise control by the programmer.

This change alters mprotect argument checking and returns an error when
unhandled protection flags are set. This differs from POSIX (in that
POSIX only specifies an error), but is the documented behavior on Linux
and more closely matches historical mmap behavior.

In addition to explicit setting of the maximum permissions, an
experimental sysctl vm.imply_prot_max causes mmap to assume that the
initial permissions requested should be the maximum when the sysctl is
set to 1. PROT_NONE mappings are excluded from this for compatibility
with rtld and other consumers that use such mappings to reserve
address space before mapping contents into part of the reservation. A
final version this is expected to provide per-binary and per-process
opt-in/out options and this sysctl will go away in its current form.
As such it is undocumented.

Reviewed by: emaste, kib (prior version), markj
Additional suggestions from: alc
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18880


# f8c8b2e8 10-Jun-2019 Doug Moore <dougm@FreeBSD.org>

r348879 introduced a wrong-way comparison that broke mmap.
This change rights that comparison.

Reported by: pho
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20595


# 77555b84 10-Jun-2019 Doug Moore <dougm@FreeBSD.org>

Change the check for 'size' wrapping around to zero in kern_mmap to account
for both the lower and upper bound modifications. Change the error returned
to ENOMEM. Rename the parameter size to len and make size a local variable
that stores the value of len after it has been modified.

This addresses concerns expressed by Bruce Evans after r348843.

Reported by: brde@optusnet.com.au
Reviewed by: kib, markj (mentors)
MFC after: 3 days
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20592


# 97220a27 09-Jun-2019 Doug Moore <dougm@FreeBSD.org>

There are times when a len==0 parameter to mmap is okay. But on a
32-bit machine, a len parameter just a few bytes short of 4G, rounded
up to a page boundary and hitting zero then, is not okay. Return
failure in that case.

Reported by: pho
Reviewed by: alc, kib (mentor)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20580


# 8cd6a80d 13-May-2019 Mark Johnston <markj@FreeBSD.org>

Restore the pre-r347532 behaviour of ignoring wiring failures in mmap().

The error handling added in r347532 is not right when mapping vnodes
and will be fixed separately.

Reported by: syzbot+1d2cc393bd6c88a548be@syzkaller.appspotmail.com
MFC with: r347532


# 54a3a114 13-May-2019 Mark Johnston <markj@FreeBSD.org>

Provide separate accounting for user-wired pages.

Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 5dddee2d 08-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

i386: honor kern.elf32.read_exec for ommap(2) and break(2), as already
done on amd64.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# a7f67fac 08-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Normalize the declaration of i386_read_exec variable.

It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable. Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.

Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.

Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3fbc2e00 07-Jan-2019 Konstantin Belousov <kib@FreeBSD.org>

Add a tunable which changes mincore(2) algorithm to only report data
from the local mapping.

Enable the setting by default.
The article behind the change: https://arxiv.org/abs/1901.01161

Reviewed by: markj
Discussed with: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18764


# cc426dd3 11-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation


# d48719bd 04-Dec-2018 Brooks Davis <brooks@FreeBSD.org>

Normalize COMPAT_43 syscall declarations.

Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.

No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15816


# 2554f86a 17-Sep-2018 Mateusz Guzik <mjg@FreeBSD.org>

vm: stop taking proc lock in mmap to satisfy racct if it is disabled

Limits can be safely obtained with lim_cur from the thread. racct is compiled
in but disabled by default. Note that racct enablement is a boot-only tunable.

This eliminates second most common place of taking the lock while pkg building.

While here don't take the lock in mlockall either.

Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17210


# 6e1d2cf6 31-Jul-2018 Konstantin Belousov <kib@FreeBSD.org>

For compat32, emulate the same wraparound check as occurs on the real
ILP32 system.

Reported by and discussed with: asomers
PR: 230162
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16525


# 3e7cb27c 04-Jun-2018 Alan Cox <alc@FreeBSD.org>

Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.

Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.

Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.

Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.

Reviewed by: kib, markj
MFC after: 7 days


# 633d3b1c 01-Jun-2018 Konstantin Belousov <kib@FreeBSD.org>

Only check for MAP_32BIT when available.

Reported by: mmacy
Sponsored by: The FreeBSD Foundation
MFC after: 10 days


# 60221a57 01-Jun-2018 Alan Cox <alc@FreeBSD.org>

Only a small subset of mmap(2)'s flags should be used in combination with
the flag MAP_GUARD. Rather than enumerating the flags that are not
allowed, enumerate the flags that are allowed. The list of allowed flags
is much shorter and less likely to change. (As an aside, one of the
previously enumerated flags, MAP_PREFAULT, was not even a legal flag for
mmap(2). However, because of an earlier check within kern_mmap(), this
misuse of MAP_PREFAULT was harmless.)

Reviewed by: kib
MFC after: 10 days


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# e958ad4c 12-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273


# 1c5196c3 19-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Assign map->header values to avoid boundary checks.

In several places, entry start and end field are checked, after
excluding the possibility that the entry is map->header. By assigning
max and min values to the start and end fields of map->header in
vm_map_init, the explicit map->header checks become unnecessary.

Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, kib, markj (previous version)
Tested by: pho (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13735


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# bd0e1beb 07-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Correct the type of foff.

No functional change intended.

Github PR: 124
Submitted by: Wuyang Chung <wuyang.m.chung@outlook.com>
MFC after: 1 week


# 6a97a3f7 27-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Treat the addr argument for mmap(2) request without MAP_FIXED flag as
a hint.

Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted
as the absolute minimal address of the range where the mapping is
created. The VA allocator only allocates in the range [addr,
VM_MAXUSER_ADDRESS]. This is too restrictive, the mmap(2) call might
unduly fail if there is no free addresses above addr but a lot of
usable space below it.

Lift this implementation limitation by allocating VA in two passes.
First, try to allocate above addr, as before. If that fails, do the
second pass with less restrictive constraints for the start of
allocation by specifying minimal allocation address at the max bss
end, if this limit is less than addr.

One important case where this change makes a difference is the
allocation of the stacks for new threads in libthr. Under some
configuration conditions, libthr tries to hint kernel to reuse the
main thread stack grow area for the new stacks. This cannot work by
design now after grow area is converted to stack, and there is no
unallocated VA above the main stack. Interpreting requested stack
base address as the hint provides compatibility with old libthr and
with (mis-)configured current libthr.

Reviewed by: alc
Tested by: dim (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 19bd0d9c 24-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Implement address space guards.

Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range. It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).

Use guards to reimplement stack grow code. Explicitely track stack
grow area with the guard, including the stack guard page. On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion. Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().

As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.

Enable stack guard page by default.

Reviewed by: alc, markj
Man page update reviewed by: alc, bjk, emaste, markj, pho
Tested by: pho, Qualys
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11306 (man pages)


# 46dc8e9d 30-Mar-2017 Dmitry Chagin <dchagin@FreeBSD.org>

Add kern_mincore() helper for micore() syscall.

Suggested by: kib@
Reviewed by: kib@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D10143


# d1780e8d 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 496ab053 13-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Rework r313352.

Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535


# 04e89ffb 12-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers.

Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 1c2ad3e9 11-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int.

This makes the code to pass whole word of the mmap(2) syscall argument
prot to the syscall helper kern_vm_mmap(), which can validate all
bits. The change provides temporal fix for sys/vm/mmap_test
mmap__bad_arguments, which was broken after r313352.

PR: 216976
Reported and tested by: ngie
Sponsored by: The FreeBSD Foundation


# 69cdfcef 06-Feb-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.

Reviewed by: ed, dchagin, kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9378


# 736ff8c3 24-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

hwpmc: partially depessimize munmap handling if the module is not loaded

HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the
common (module not loaded) case.

In particular this avoids permission checks + lock downgrade
singlethreaded and in cases were an executable mapping is found the pmc
sx lock is no longer bounced.

Note this is a band aid.

MFC after: 1 week


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# 0df42647 10-Jul-2016 Robert Watson <rwatson@FreeBSD.org>

When mmap(2) is used with a vnode, capture vnode attributes in the
audit trail. This was not required for Common Criteria auditing
(which requires only that the intent to read or write be audited
at the time of open(2)), but is useful for contemporary live
analysis and forensics.

MFC after: 3 days
Sponsored by: DARPA, AFRL


# 51d1f690 10-Jul-2016 Robert Watson <rwatson@FreeBSD.org>

Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after: 3 days
Sponsored by: DARPA, AFRL


# 010ba384 05-Jul-2015 Mark Johnston <markj@FreeBSD.org>

Add a local variable initialization needed in the OBJT_DEFAULT case.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D2992


# cd336bad 02-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vm: don't lock proc around accesses to vm_{t,d}addr and RLIMIT_DATA in sys_mmap

vm_{t,d}addr are constant and we can use thread's copy of resource limits


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 7077c426 04-Jun-2015 John Baldwin <jhb@FreeBSD.org>

Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions. A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings. For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset. The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional. If it is not set, then fo_mmap() will
fail with ENODEV. A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead. While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision: https://reviews.freebsd.org/D2658
Reviewed by: alc (glanced over), kib
MFC after: 1 month
Sponsored by: Chelsio


# 4b5c9cf6 29-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 0538aafc 18-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3d653db0 21-Mar-2015 Alan Cox <alc@FreeBSD.org>

Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks


# f81b73f3 28-Feb-2015 Alan Cox <alc@FreeBSD.org>

Eliminate a variable that became unused when VFS_LOCK_GIANT() was
eliminated.

MFC after: 3 days


# 01ca58b2 05-Dec-2014 John Baldwin <jhb@FreeBSD.org>

Always ignore the deprecated MAP_RENAME and MAP_NORESERVE flags to mmap().
Some old libraries may be used even with newer binaries (specifically the
Nvidia driver libraries).

Differential Revision: https://reviews.freebsd.org/D1262
Reviewed by: kib


# 5817298f 17-Oct-2014 John Baldwin <jhb@FreeBSD.org>

Retire the unimplemented MAP_RENAME and MAP_NORESERVE flags to mmap(2).
Older binaries are still permitted to use these flags.

PR: 193961 (exp-run in ports)
Differential Revision: https://reviews.freebsd.org/D848
Reviewed by: kib


# 10204535 17-Sep-2014 Konstantin Belousov <kib@FreeBSD.org>

The vm_mmap_cdev() explicitely converts absence of both MAP_SHARED and
MAP_PRIVATE flags to MAP_SHARED. Apparently, some code in tree, in
particular, libgeom, relied on this behaviour, see r271721. For
regular file types, the absence of the flags is interpreted as
MAP_PRIVATE, and libc nlist used this (fixed in r271723).

Allow the implicit flags for legacy binaries. Bump __FreeBSD_version
to get the ABI note on new binaries to check for in mmap code.

Remove the test for presence of one of the MAP_ANON, MAP_SHARED or
MAP_PRIVATE flags before fget_mmap(). For MAP_ANON, we already verify
that passed fd == -1. For fd != -1, test after fget_mmap() (for newer
binaries) covers the case.

Reported by: bdrewery, pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation


# 8bafac54 16-Sep-2014 John Baldwin <jhb@FreeBSD.org>

Permit MAP_RENAME and MAP_NORESERVE for now. These flags should be removed, but at least
Chromium and OpenJDK use MAP_NORESERVE.


# 5fd3f8b3 15-Sep-2014 John Baldwin <jhb@FreeBSD.org>

Add stricter checking of some mmap() arguments:
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
mappings.

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D698


# e7d939bd 06-Jul-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 11c42bcc 18-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED,
and prevents the request from deleting existing mappings in the
region, failing instead.

Reviewed by: alc
Discussed with: jhb
Tested by: markj, pho (previous version, as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4648ba0a 08-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

Make mmap(MAP_STACK) search for the available address space, similar
to !MAP_STACK mapping requests. For MAP_STACK | MAP_FIXED, clear any
mappings which could previously exist in the used range.

For this, teach vm_map_find() and vm_map_fixed() to handle
MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new
vm_map_stack_locked() helper, which is factored out from
vm_map_stack().

The side effect of the change is that MAP_STACK started obeying
MAP_ALIGNMENT and MAP_32BIT flags.

Reported by: rwatson
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e103f5b1 07-May-2014 Peter Holm <pho@FreeBSD.org>

msync(2) must return ENOMEM and not EINVAL when the address is outside the
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.

Discussed with: attilio, Bruce Evans
Reviewed by: alc, Garrett Cooper
Reported by: ATF
MFC after: 2 weeks
Sponsored by: EMC / Isilon storage division


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# 55648840 19-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# 6a87d217 12-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Fix an off-by-one error when populating mincore(2) entries for
skipped entries. lastvecindex references the last valid byte,
so the new bytes should come after it.

Approved by: re (kib)
MFC after: 1 week


# edb572a3 09-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by: alc
Approved by: re (kib)


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 5aa60b6f 16-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Add new mmap(2) flags to permit applications to request specific virtual
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
Requests for n >= number of bits in a pointer or less than the size of
a page fail with EINVAL. This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used
to optimize the chances of using large pages. By default it will align
the mapping on a large page boundary (the system is free to choose any
large page size to align to that seems best for the mapping request).
However, if the object being mapped is already using large pages, then
it will align the virtual mapping to match the existing large pages in
the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
explicitly using VMFS_SUPER_SPACE. All device objects are forced to
use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
equivalent.

Reviewed by: alc
MFC after: 1 month


# fc2b1679 22-Jul-2013 Jeremie Le Hen <jlh@FreeBSD.org>

Fix previous commit when option RACCT is not used.

MFC after: 7 days


# c92b5069 22-Jul-2013 Jeremie Le Hen <jlh@FreeBSD.org>

Fix a panic in the racct code when munlock(2) is called with incorrect values.

The racct code in sys_munlock() assumed that the boundaries provided by the
userland were correct as long as vm_map_unwire() returned successfully.
However the latter contains its own logic and sometimes manages to do something
out of those boundaries, even if they are buggy. This change makes the racct
code to use the accounting done by the vm layer, as it is done in other places
such as vm_mlock().

Despite fixing the panic, Alan Cox pointed that this code is still race-y
though: two simultaneous callers will produce incorrect values.

Reviewed by: alc
MFC after: 7 days


# ff74a3fa 19-Jul-2013 John Baldwin <jhb@FreeBSD.org>

Be more aggressive in using superpages in all mappings of objects:
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
vm_map_find() that will try to alter the alignment of a mapping to match
any existing superpage mappings of the object being mapped. If no
suitable address range is found with the necessary alignment,
vm_map_find() will fall back to using the simple first-fit strategy
(VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.

Reviewed by: alc (earlier version)
MFC after: 2 weeks


# 995d7069 08-Jun-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Make sys_mlock() function just a wrapper around vm_mlock() function
that does all the job.

Reviewed by: kib, jilles
Sponsored by: Nginx, Inc.


# 53f5f8a0 02-May-2013 Konstantin Belousov <kib@FreeBSD.org>

Add a hint suggesting why tmpfs does not need a special case there.


# e5f299ff 28-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode'
v_object of non OBJT_VNODE type.

For vm_object_page_clean(), simply do not assert that object type must
be OBJT_VNODE, and add a comment explaining how the check for
OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such
objects.

For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it
to be for swap pager (or default), handle the bypass filesystems, and
correctly acquire the object reference in this case.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 week


# bafa6cfc 28-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Release the v_writecount reference on the vnode in case of error,
before the vnode is vput() in vm_mmap_vnode(). Error return means
that there is no use reference on the vnode from the vm object
reference, and failing to restore v_writecount breaks the invariant
that v_writecount is less or equal to the usecount.

The situation observed when nfs client returns ESTALE for
VOP_GETATTR() after the open.

In collaboration with: pho
MFC after: 1 week


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 3ac7d297 09-Jan-2013 Andrey Zonov <zont@FreeBSD.org>

- Reduce kernel size by removing unnecessary pointer indirections.

GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes.

Suggested by: alc
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 week


# 7e19eda4 18-Dec-2012 Andrey Zonov <zont@FreeBSD.org>

- Fix locked memory accounting for maps with MAP_WIREFUTURE flag.
- Add sysctl vm.old_mlock which may turn such accounting off.

Reviewed by: avg, trasz
Approved by: kib (mentor)
MFC after: 1 week


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# c4e357e8 05-Sep-2012 Andrey Zonov <zont@FreeBSD.org>

- Simplify VM code by using vmspace_wired_count() for counting wired
memory of a process.

Reviewed by: avg
Approved by: kib (mentor)
MFC after: 2 weeks


# ee4116b8 13-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

For old mmap syscall, when executing on amd64 or ia64, enforce the
PROT_EXEC if prot is non-zero, process is 32bit and
kern.elf32.i386_read_exec syscal is enabled. This workaround is needed
for old i386 a.out binaries, where dynamic linker did not specified
PROT_EXEC for mapping of the text.

The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries,
but I reused the existing knob which already has the needed semantic.

MFC after: 1 week


# 7707ccab 14-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Adjust the r205536, by allowing a non-zero offset for anonymous
mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD
1.1.5.1 can issue such requests.

Reported and tested by: Dan Plassche <dplassche@gmail.com>
MFC after: 1 week


# 1472f4f4 21-Apr-2012 Konstantin Belousov <kib@FreeBSD.org>

When MAP_STACK mapping is created, the map entry is created only to
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.

Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.

Reported and tested by: Sushanth Rai <sushanth_rai yahoo com>
Reviewed by: alc
MFC after: 1 week


# 1c8279e4 08-Apr-2012 Alan Cox <alc@FreeBSD.org>

Fix mincore(2) so that it reports PG_CACHED pages as resident.

MFC after: 2 weeks


# 126d6082 17-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks


# 83cbe16f 02-Mar-2012 Alan Cox <alc@FreeBSD.org>

Eliminate stale incorrect ARGSUSED comments.

Submitted by: bde


# f9230ad6 25-Feb-2012 Alan Cox <alc@FreeBSD.org>

Simplify vm_mmap()'s control flow.

Add a comment describing what vm_mmap_to_errno() does.

Reviewed by: kib
MFC after: 3 weeks
X-MFC after: r232071


# 9d22083d 24-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Place the if() at the right location, to activate the v_writecount
accounting for shared writeable mappings for all filesystems, not only
for the bypass layers.

Submitted by: alc
Pointy hat to: kib
MFC after: 20 days


# 84110e7e 23-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Account the writeable shared mappings backed by file in the vnode
v_writecount. Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry. During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by: tegge
Reviewed by: tegge (long time ago, early version), alc
Tested by: pho
MFC after: 3 weeks


# a6492969 15-Feb-2012 Alan Cox <alc@FreeBSD.org>

When vm_mmap() is used to map a vm object into a kernel vm_map, it
makes no sense to check the size of the kernel vm_map against the
user-level resource limits for the calling process.

Reviewed by: kib


# 8211bd45 11-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Close a race due to dropping of the map lock between creating map entry
for a shared mapping and marking the entry for inheritance.
Other thread might execute vmspace_fork() in between (e.g. by fork(2)),
resulting in the mapping becoming private.

Noted and reviewed by: alc
MFC after: 1 week


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# 2e32165c 10-Jul-2011 Konstantin Belousov <kib@FreeBSD.org>

Extract the code to translate VM error into errno, into an exported
function vm_mmap_to_errno(). It is useful for the drivers that implement
mmap(2)-like functionality, to be able to return error codes consistent
with mmap(2).

Sponsored by: The FreeBSD Foundation
No objections from: alc
MFC after: 1 week


# 3103730c 10-Jul-2011 Konstantin Belousov <kib@FreeBSD.org>

Style.

MFC after: 3 days


# afcc55f3 06-Jul-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# 1ba5ad42 05-Apr-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add accounting for most of the memory-related resources.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 7ec9c8d1 24-Feb-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion.
As it was pointed out by Alan Cox, that no longer serves its purpose with
the modern UMA allocator compared to the old one used in 4.x days.

The removal of sysctl eliminates max_proc_mmap type overflow leading to
the broken mmap(2) seen with large amount of physical memory on arches
with factually unbound KVA space (such as amd64). It was found that
slightly less than 256GB of physmem was enough to trigger the overflow.

Reviewed by: alc, kib
Approved by: avg (mentor)
MFC after: 2 months


# a2f510e8 04-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix comment intentation.


# 7022f954 14-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by: swell.k gmail com
MFC after: 1 week


# 94bce453 14-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Use symbolic names instead of hardcoding values for magic p_osrel constants.

MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# da048309 19-Sep-2010 Alan Cox <alc@FreeBSD.org>

Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8. This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR: 150260
MFC after: 3 weeks


# d473d3a1 06-Sep-2010 Ryan Stone <rstone@FreeBSD.org>

Fix a typo in r212281. uintptr -> uintptr_t

Pointy hat to: rstone

Approved by: emaste (mentor)
MFC after: 2 weeks


# 0d419640 06-Sep-2010 Ryan Stone <rstone@FreeBSD.org>

In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock. This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by: fabient
Approved by: emaste (mentor)
MFC after: 2 weeks


# 74ffb9af 28-Aug-2010 Alan Cox <alc@FreeBSD.org>

Add the MAP_PREFAULT_READ option to mmap(2).

Reviewed by: jhb, kib


# 3979450b 06-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with: pho
MFC after: 1 month


# fd6f4ffb 27-Jul-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.


# c46b90e9 26-May-2010 Alan Cox <alc@FreeBSD.org>

Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced(). Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively. In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting. This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages(). This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition. Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by: kib


# 567e51e1 24-May-2010 Alan Cox <alc@FreeBSD.org>

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 7b85f591 24-Apr-2010 Alan Cox <alc@FreeBSD.org>

Resurrect pmap_is_referenced() and use it in mincore(). Essentially,
pmap_ts_referenced() is not always appropriate for checking whether or
not pages have been referenced because it clears any reference bits
that it encounters. For example, in mincore(), clearing the reference
bits has two negative consequences. First, it throws off the activity
count calculations performed by the page daemon. Specifically, a page
on which mincore() has called pmap_ts_referenced() looks less active
to the page daemon than it should. Consequently, the page could be
deactivated prematurely by the page daemon. Arguably, this problem
could be fixed by having mincore() duplicate the activity count
calculation on the page. However, there is a second problem for which
that is not a solution. In order to clear a reference on a 4KB page,
it may be necessary to demote a 2/4MB page mapping. Thus, a mincore()
by one process can have the side effect of demoting a superpage
mapping within another process!


# f70ad548 14-Apr-2010 John Baldwin <jhb@FreeBSD.org>

MFC 205536:
Reject attempts to create a MAP_ANON mapping with a non-zero offset.


# 5711bf30 23-Mar-2010 John Baldwin <jhb@FreeBSD.org>

Reject attempts to create a MAP_ANON mapping with a non-zero offset.

PR: kern/71258
Submitted by: Alexander Best
MFC after: 2 weeks


# fc508327 02-Oct-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Back out the functional parts from r197537. After r197711, affecting all
user mappings, mmap no longer needs special treatment.


# 27bfa958 27-Sep-2009 Simon L. B. Nielsen <simon@FreeBSD.org>

Do not allow mmap with the MAP_FIXED argument to map at address zero.
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities. While this of course does not fix vulnerabilities,
it does mitigate their impact.

Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.

This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.

Discussed with: rwatson, bz
Tested by: bz (Wine), simon (VirtualBox)
Submitted by: jhb


# 2c5f9fbe 23-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197348:
For a.out and pre-8 ELF binaries, allow the mmap of zero length.

Approved by: re (kensmith)


# 497a8238 19-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

Old (a.out) rtld attempts to mmap zero-length region, e.g. when bss
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.

For a.out and pre-8 ELF binaries, allow the mmap of zero length.

Reported by: tegge
Reviewed by: tegge, alc, jhb
MFC after: 3 days


# 0fe0ed8b 14-Jul-2009 John Baldwin <jhb@FreeBSD.org>

- Change mmap() to fail requests with EINVAL that pass a length of 0. This
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.

Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week


# 3364c323 23-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 64345f0b 01-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
added to cdevsw. This function is called for each mmap() request.
If it returns ENODEV, then the mmap() request will fall back to using
the device's device pager object and d_mmap(). Otherwise, the method
can return a VM object to satisfy this entire mmap() request via
*object. It can also modify the starting offset into this object via
*foff. This allows device drivers to use the file offset as a cookie
to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
mapping V_CHR vnodes. This avoids duplicating all the cdev mmap
handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02. Older device drivers
using D_VERSION_01 are still supported.

MFC after: 1 month


# beb3c3a9 04-Apr-2009 Alan Cox <alc@FreeBSD.org>

Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,
but I see no benefit from it today.

VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission. On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission. The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c
revision 1.56.)

Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list. So,
the need for coalescing vm map entries is not as great as it once was.


# 655c3490 24-Feb-2009 Konstantin Belousov <kib@FreeBSD.org>

Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by: pho
Reviewed by: alc


# 897d81a0 08-Feb-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not call vm_object_deallocate() from vm_map_delete(), because we
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().

Reviewed by: tegge, alc
Tested by: pho


# fa3de770 21-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Now that vfs_markatime() no longer requires an exclusive lock due to
the VOP_MARKATIME() changes, use a shared vnode lock for mmap().

Submitted by: ups


# 556c3162 22-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

Update mmap() comment: no more block devices, so no more block device
cache coherency questions.

MFC after: 3 days


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 36b90789 20-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Allow the d_mmap driver methods to use cdevpriv KPI during verification
phase of establishing mapping.

Discussed with: rwatson, jhb, rnoland
Tested by: rnoland
MFC after: 3 days


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 6bd9cb1c 03-Aug-2008 Tom Rhodes <trhodes@FreeBSD.org>

Fill in a few sysctl descriptions.

Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by: alc


# ba304211 24-May-2008 Alan Cox <alc@FreeBSD.org>

To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped. Specifically, it has
returned EINVAL if the entire range is not mapped. There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement. This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks


# d0a83a83 17-May-2008 Alan Cox <alc@FreeBSD.org>

In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping. Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address. Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address. If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated. In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().


# b8ca4ef2 27-Apr-2008 Alan Cox <alc@FreeBSD.org>

vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.


# 91a35e78 20-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.

PR: kern/119422
MFC after: 1 week


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 8e38aeff 08-Jan-2008 John Baldwin <jhb@FreeBSD.org>

Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# c899450b 18-Oct-2007 Peter Wemm <peter@FreeBSD.org>

Fix cosmetic bug in stale copy of msync_args. 'len' is size_t, not int.


# d239bd3c 19-Aug-2007 Konstantin Belousov <kib@FreeBSD.org>

Do not drop vm_map lock between doing vm_map_remove() and vm_map_insert().
For this, introduce vm_map_fixed() that does that for MAP_FIXED case.

Dropping the lock allowed for parallel thread to occupy the freed space.

Reported by: Tijl Coosemans <tijl ulyssis org>
Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# c2815ad5 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate

Approved by: re (kensmith)


# 6bda842d 16-Jun-2007 Matt Jacob <mjacob@FreeBSD.org>

Make sure object is NULL- there is a possible case where you could
fall through to it being used w/o being set. Put a break in the default
case.


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 455dd7d4 20-Jun-2006 Konstantin Belousov <kib@FreeBSD.org>

Make the mincore(2) return ENOMEM when requested range is not fully mapped.

Requested by: Bruno Haible <bruno at clisp org>
Reviewed by: alc
Approved by: pjd (mentor)
MFC after: 1 month


# 89eae00b 21-Apr-2006 Tom Rhodes <trhodes@FreeBSD.org>

It seems that POSIX would rather ENODEV returned in place of EINVAL when
trying to mmap() an fd that isn't a normal file.

Reference: http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
Submitted by: fanf


# 49874f6e 25-Mar-2006 Joseph Koshy <jkoshy@FreeBSD.org>

MFP4: Support for profiling dynamically loaded objects.

Kernel changes:

Inform hwpmc of executable objects brought into the system by
kldload() and mmap(), and of their removal by kldunload() and
munmap(). A helper function linker_hwpmc_list_objects() has been
added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve
the list of currently loaded kernel modules.

The unused `MAPPINGCHANGE' event has been deprecated in favour
of separate `MAP_IN' and `MAP_OUT' events; this change reduces
space wastage in the log.

Bump the hwpmc's ABI version to "2.0.00". Teach hwpmc(4) to
handle the map change callbacks.

Change the default per-cpu sample buffer size to hold
32 samples (up from 16).

Increment __FreeBSD_version.

libpmc(3) changes:

Update libpmc(3) to deal with the new events in the log file; bring
the pmclog(3) manual page in sync with the code.

pmcstat(8) changes:

Introduce new options to pmcstat(8): "-r" (root fs path), "-M"
(mapfile name), "-q"/"-v" (verbosity control). Option "-k" now
takes a kernel directory as its argument but will also work with
the older invocation syntax.

Rework string handling in pmcstat(8) to use an opaque type for
interned strings. Clean up ELF parsing code and add support for
tracking dynamic object mappings reported by a v2.0.00 hwpmc(4).

Report statistics at the end of a log conversion run depending
on the requested verbosity level.

Reviewed by: jhb, dds (kernel parts of an earlier patch)
Tested by: gallatin (earlier patch)


# 9f5c1d19 12-Oct-2005 Diomidis Spinellis <dds@FreeBSD.org>

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 1e309003 04-Oct-2005 Diomidis Spinellis <dds@FreeBSD.org>

Update the vnode's access time after an mmap operation on it.
Before this change a copy operation with cp(1) would not update the
file access times.

According to the POSIX mmap(2) documentation: the st_atime field
of the mapped file may be marked for update at any time between the
mmap() call and the corresponding munmap() call. The initial read
or write reference to a mapped region shall cause the file's st_atime
field to be marked for update if it has not already been marked for
update.


# 749474f2 20-Sep-2005 Peter Wemm <peter@FreeBSD.org>

Remove unused (but initialized) variable 'objsize' from vm_mmap()


# c92163dc 14-Apr-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Move MAC check_vnode_mmap entry point out from being exclusive to
MAP_SHARED so that the entry point gets executed un-conditionally.
This may be useful for security policies which want to perform access
control checks around run-time linking.

-add the mmap(2) flags argument to the check_vnode_mmap entry point
so that we can make access control decisions based on the type of
mapped object.
-update any dependent API around this parameter addition such as
function prototype modifications, entry point parameter additions
and the inclusion of sys/mman.h header file.
-Change the MLS, BIBA and LOMAC security policies so that subject
domination routines are not executed unless the type of mapping is
shared. This is done to maintain compatibility between the old
vm_mmap_vnode(9) and these policies.

Reviewed by: rwatson
MFC after: 1 month


# 98df9218 01-Apr-2005 John Baldwin <jhb@FreeBSD.org>

- Change the vm_mmap() function to accept an objtype_t parameter specifying
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
vnodes and anonymous memory. Note that mmaping a cdev directly does not
currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
cdev the ioctl is acting on rather than trying to find a suitable vnode
to map from.

Reviewed by: alc, arch@


# 8516dd18 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# ae51ff11 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove GIANT_REQUIRED where giant is no longer required.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.

Sponsored By: Isilon Systems, Inc.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# ff4782b5 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Don't clear flags we just checked were not set.


# 891822a8 24-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

XXX mark two places where we do not hold a threadcount on the dev when
frobbing the cdevsw.

In both cases we examine only the cdevsw and it is a good question if we
weren't better off copying those properties into the cdev in the first
place. This question will be revisited.


# c2296e99 01-Sep-2004 Alan Cox <alc@FreeBSD.org>

Remove dead code.


# 23fc1a90 05-Aug-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a product specific workaround for wrong modes when mmap(2)'ing
devices. They have had plenty of time to adjust now.


# 21c12545 01-Aug-2004 Alan Cox <alc@FreeBSD.org>

Eliminate the acquisition and release of Giant around the call to
pmap_mincore() in mincore(2). Either pmap locking exists (alpha, amd64,
i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).


# 1930e303 11-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Deorbit COMPAT_SUNOS.

We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.


# 8eec77b0 11-May-2004 Tim J. Robbins <tjr@FreeBSD.org>

To handle orphaned character device vnodes properly in mmap(), check that
v_mount is non-null before dereferencing it. If it's null, behave as if
MNT_NOEXEC was not set on the mount that originally containined it.


# 05eb3785 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# ce7a036d 04-Apr-2004 Alexander Kabaev <kan@FreeBSD.org>

Delay permission checks for VCHR vnodes until after vnode is locked in
vm_mmap_vnode function, where we can safely check for a special /dev/zero
case. Rev. 1.180 has reordered checks and introduced a regression.

Submitted by: alc
Was broken by: kan


# b483c7f6 18-Mar-2004 Guido van Rooij <guido@FreeBSD.org>

When mmap-ing a file from a noexec mount, be sure not to grant the right
to mmap it PROT_EXEC. This also depends on the architecture, as some
architextures (e.g. i386) do not distinguish between read and exec pages

Inspired by: http://linux.bkbits.net:8080/linux-2.4/cset@1.1267.1.85
Reviewed by: alc


# bb734798 15-Mar-2004 Don Lewis <truckman@FreeBSD.org>

Make overflow/wraparound checking more robust and unbreak len=0 in
vslock(), mlock(), and munlock().

Reviewed by: bde


# f0ea4612 14-Mar-2004 Don Lewis <truckman@FreeBSD.org>

Style(9) changes.

Pointed out by: bde


# be4c5ad0 14-Mar-2004 Don Lewis <truckman@FreeBSD.org>

Remove redundant suser() check.


# 16929939 05-Mar-2004 Don Lewis <truckman@FreeBSD.org>

Undo the merger of mlock()/vslock and munlock()/vsunlock() and the
introduction of kern_mlock() and kern_munlock() in
src/sys/kern/kern_sysctl.c 1.150
src/sys/vm/vm_extern.h 1.69
src/sys/vm/vm_glue.c 1.190
src/sys/vm/vm_mmap.c 1.179
because different resource limits are appropriate for transient and
"permanent" page wiring requests.

Retain the kern_mlock() and kern_munlock() API in the revived
vslock() and vsunlock() functions.

Combine the best parts of each of the original sets of implementations
with further code cleanup. Make the mclock() and vslock()
implementations as similar as possible.

Retain the RLIMIT_MEMLOCK check in mlock(). Move the most strigent
test, which can return EAGAIN, last so that requests that have no
hope of ever being satisfied will not be retried unnecessarily.

Disable the test that can return EAGAIN in the vslock() implementation
because it will cause the sysctl code to wedge.

Tested by: Cy Schubert <Cy.Schubert AT komquats.com>


# 30d4dd7e 29-Feb-2004 Alexander Kabaev <kan@FreeBSD.org>

Pich up a do {} while(0) cleanup by phk that was discarded accidentally in
previous revision.

Submitted by: alc


# c8daea13 27-Feb-2004 Alexander Kabaev <kan@FreeBSD.org>

Move the code dealing with vnode out of several functions into a single
helper function vm_mmap_vnode.

Discussed with: jeffr,alc (a while ago)


# 47934cef 25-Feb-2004 Don Lewis <truckman@FreeBSD.org>

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 91d5354a 04-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# cafe836a 20-Dec-2003 Alan Cox <alc@FreeBSD.org>

- Correct an error in mincore(2) that has existed since its introduction:
mincore(2) should check that the page is valid, not just allocated.
Otherwise, it can return a false positive for a page that is not yet
resident because it is being read from disk.


# 5e6dbda0 07-Dec-2003 Alexander Kabaev <kan@FreeBSD.org>

Remove trailing whitespace.


# c8123cb8 07-Dec-2003 Alan Cox <alc@FreeBSD.org>

Addendum to revision 1.174: In the case where vm_pager_allocate() is called
to create a vnode-backed object, the vnode lock must be held by the caller.

Reported by: truckman
Discussed with: kan


# 20eec4bb 05-Dec-2003 Alan Cox <alc@FreeBSD.org>

Fix a deadlock between vm_fault() and vm_mmap(): The expected lock ordering
between vm_map and vnode locks is that vm_map locks are acquired first. In
revision 1.150 mmap(2) was changed to pass a locked vnode into vm_mmap().
This creates a lock-order reversal when vm_mmap() calls one of the vm_map
routines that acquires a vm_map lock. The solution implemented herein is
to release the vnode lock in mmap() before calling vm_mmap() and reacquire
this lock if necessary in vm_mmap().

Approved by: re (scottl)
Reviewed by: jeff, kan, rwatson


# 6f8b4fc0 14-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Remove long dead code.


# b7b7cd44 13-Nov-2003 Alan Cox <alc@FreeBSD.org>

Changes to msync(2)
- Return EBUSY if the region was wired by mlock(2) and MS_INVALIDATE
is specified to msync(2). This is required by the Open Group Base
Specifications Issue 6.
- vm_map_sync() doesn't return KERN_FAILURE. Thus, msync(2) can't
possibly return EIO.
- The second major loop in vm_map_sync() handles sub maps. Thus,
failing on sub maps in the first major loop isn't necessary.


# d8834602 09-Nov-2003 Alan Cox <alc@FreeBSD.org>

- The Open Group Base Specifications Issue 6 specifies that an munmap(2)
must return EINVAL if size is zero. Submitted by: tegge
- In order to avoid a race condition in multithreaded applications, the
check and removal operations by munmap(2) must be in the same critical
section. To accomodate this, vm_map_check_protection() is modified to
require its caller to obtain at least a read lock on the map.


# 637315ed 09-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Remove Giant from msync(2). Giant is still acquired by the lower layers
if we drop into the pmap or vnode layers.
- Migrate the handling of zero-length msync(2)s into vm_map_sync() so that
multithread applications can't change the map between implementing the
zero-length hack in msync(2) and reacquiring the map lock in
vm_map_sync().

Reviewed by: tegge


# 950f8459 08-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact
that msync(2) is its only caller.
- Migrate the parts of the old vm_map_clean() that examined the internals
of a vm object to a new function vm_object_sync() that is implemented in
vm_object.c. At the same, introduce the necessary vm object locking so
that vm_map_sync() and vm_object_sync() can be called without Giant.

Reviewed by: tegge


# 11f7ddc5 05-Oct-2003 Bruce M Simpson <bms@FreeBSD.org>

Only the super-user should be able to wire pages via the mlock() family
of system calls at this time. Remove various #ifdef's to enforce this.


# fd75d710 27-Sep-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Part 2 of implementing rstacks: add the ability to create rstacks and
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.

Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.

Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.

Tested on: alpha, i386, ia64, sparc64


# c460ac3a 24-Sep-2003 Peter Wemm <peter@FreeBSD.org>

Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.


# 7ebcee37 07-Sep-2003 Alan Cox <alc@FreeBSD.org>

Revise the locking in mincore(2).


# abd498aa 11-Aug-2003 Bruce M Simpson <bms@FreeBSD.org>

Add the mlockall() and munlockall() system calls.
- All those diffs to syscalls.master for each architecture *are*
necessary. This needed clarification; the stub code generation for
mlockall() was disabled, which would prevent applications from
linking to this API (suggested by mux)
- Giant has been quoshed. It is no longer held by the code, as
the required locking has been pushed down within vm_map.c.
- Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
to express their intention explicitly.
- Inspected at the vmstat, top and vm pager sysctl stats level.
Paging-in activity is occurring correctly, using a test harness.
- The RES size for a process may appear to be greater than its SIZE.
This is believed to be due to mappings of the same shared library
page being wired twice. Further exploration is needed.
- Believed to back out of allocations and locks correctly
(tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).

PR: kern/43426, standards/54223
Reviewed by: jake, alc
Approved by: jake (mentor)
MFC after: 2 weeks


# a5d841d4 03-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unnecessary cast.


# 3b6d9652 22-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


# a6af4ff1 21-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Use a do {...} while (0); and a couple of breaks to reduce the level
of indentation a bit.


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# bc5b057f 09-Jun-2003 Alan Cox <alc@FreeBSD.org>

Hold the vm object's lock when performing vm_page_lookup().


# 69297bf8 17-Apr-2003 John Baldwin <jhb@FreeBSD.org>

suser() does not need the proc lock, just the setting of P_PROTECTED in
p_flag needs the lock.


# f4cf2141 31-Mar-2003 Wes Peters <wes@FreeBSD.org>

Add a facility allowing processes to inform the VM subsystem they are
critical and should not be killed when pageout is looking for more
memory pages in all the wrong places.

Reviewed by: arch@
Sponsored by: St. Bernard Software


# 6900a17c 29-Mar-2003 Maxime Henrion <mux@FreeBSD.org>

The object type can't be OBJT_PHYS in vm_mmap().

Reviewed by: peter


# 48e3128b 12-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# cd72f218 11-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# e80b7b69 28-Nov-2002 Alan Cox <alc@FreeBSD.org>

Lock page field accesses in mincore().

Approved by: re (blanket)


# 3e732e7d 22-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Invoke mac_check_vnode_mmap() during mmap operations on vnodes,
permitting policies to restrict access to memory mapping based on
the credential requesting the mapping, the target vnode, the
requested rights, or other policy considerations.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 05ba50f5 21-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


# f6b5b182 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Hold a lock on the vnode acquired from the file table across the call to
vm_mmap() as well as the GETATTR etc.
- If the handle is a vnode in vm_mmap() assert that it is locked.
- Wiggle Giant around a little to account for the extra vnode operation.


# 070f64fe 25-Jun-2002 Matthew Dillon <dillon@FreeBSD.org>

Part I of RLIMIT_VMEM implementation. Implement core functionality for
a new resource limit that covers a process's entire VM space, including
mmap()'d space.

(Part II will be additional code to check RLIMIT_VMEM during exec() but it
needs more fleshing out).

PR: kern/18209
Submitted by: Andrey Alekseyev <uitm@zenon.net>, Dmitry Kim <jason@nichego.net>
MFC after: 7 days


# 2cd301d1 22-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Remove the unnecessary acquisition and release of Giant around fdrop()
in mmap(2).


# c04c996b 22-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Reduce the scope of Giant in vm_mmap() to just the code that manipulates
a vnode. (Thus, MAP_ANON and MAP_STACK never acquire Giant.)


# 042bb299 16-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Remove GIANT_REQUIRED from vm_fault_user_wire().
o Move pmap_pageable() outside of Giant in vm_fault_unwire().
(pmap_pageable() is a no-op on all supported architectures.)
o Remove the acquisition and release of Giant from mlock().


# e30616db 14-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Remove the acquisition and release of Giant from munlock().

Reviewed by: tegge


# 1d7cf06c 14-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Use vm_map_wire() and vm_map_unwire() in place of vm_map_pageable() and
vm_map_user_pageable().
o Remove vm_map_pageable() and vm_map_user_pageable().
o Remove vm_map_clear_recursive() and vm_map_set_recursive(). (They were
only used by vm_map_pageable() and vm_map_user_pageable().)

Reviewed by: tegge


# fa721254 06-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

fix typo in _SYS_SYSPROTO_H_ case: s/mlockall_args/munlockall_args

Submitted by: Mark Santcroos <marks@ripe.net>


# 99b9331a 30-May-2002 Alfred Perlstein <alfred@FreeBSD.org>

Check for defined(__i386__) instead of just defined(i386) since the compiler
will be updated to only define(__i386__) for ANSI cleanliness.


# 4b9fdc2b 25-May-2002 Alan Cox <alc@FreeBSD.org>

o Acquire and release Giant around pmap operations in vm_fault_unwire()
and vm_map_delete(). Assert GIANT_REQUIRED in vm_map_delete()
only if operating on the kernel_object or the kmem_object.
o Remove GIANT_REQUIRED from vm_map_remove().
o Remove the acquisition and release of Giant from munmap().


# e0be79af 18-May-2002 Alan Cox <alc@FreeBSD.org>

o Eliminate the acquisition and release of Giant from minherit(2).
(vm_map_inherit() no longer requires Giant to be held.)


# 094f6d26 18-May-2002 Alan Cox <alc@FreeBSD.org>

o Remove GIANT_REQUIRED from vm_map_madvise(). Instead, acquire and
release Giant around vm_map_madvise()'s call to pmap_object_init_pt().
o Replace GIANT_REQUIRED in vm_object_madvise() with the acquisition
and release of Giant.
o Remove the acquisition and release of Giant from madvise().


# 43285049 17-May-2002 Alan Cox <alc@FreeBSD.org>

o Remove the acquisition and release of Giant from mprotect().


# 8c5c5d04 03-May-2002 Alan Cox <alc@FreeBSD.org>

o Remove GIANT_REQUIRED from vm_map_lookup_entry() and
vm_map_check_protection().
o Call vm_map_check_protection() without Giant held in munmap().


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 11caded3 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 1e92845e 15-Feb-2002 Bruce Evans <bde@FreeBSD.org>

Garbage-collect options ACPI_NO_ENABLE_ON_BOOT, AML_DEBUG, BLEED,
DEVICE_SYSCTLS, KEY, LOUTB, NFS_MUIDHASHSIZ, NFS_UIDHASHSIZ, PCI_QUIET
and SIMPLELOCK_DEBUG.


# a4db4953 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# cbc89bfb 10-Oct-2001 Paul Saab <ps@FreeBSD.org>

Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader
tunable.

Reviewed by: peter
MFC after: 2 weeks


# 8c5d4fe8 26-Sep-2001 Robert Watson <rwatson@FreeBSD.org>

o Modify access control checks in mmap() to use securelevel_gt() instead
of direct variable access.

Obtained from: TrustedBSD Project


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# d2c60af8 30-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Cleanup


# 676274db 24-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Remove support for the badly broken MAP_INHERIT (from -current only).


# 54d92145 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

whitespace / register cleanup


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 190609dd 24-May-2001 John Baldwin <jhb@FreeBSD.org>

Stick VM syscalls back under Giant if the BLEED option is not defined.


# e4ca250d 23-May-2001 John Baldwin <jhb@FreeBSD.org>

- Obtain Giant in mmap() syscall while messing with file descriptors and
vnodes.
- Fix an old bug that would leak a reference to a fd if the vnode being
mmap'd wasn't of type VREG or VCHR.
- Lock Giant in vm_mmap() around calls into the VM that can call into
pager routines that need Giant or into other VM routines that need
Giant.
- Replace code that used a goto to jump around the else branch of a test
to use an else branch instead.


# 12635f9c 22-May-2001 John Baldwin <jhb@FreeBSD.org>

Unlock the VM lock at the end of munlock() instead of locking it again.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 279d7226 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 3592b715 26-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


# 2589f249 25-Jun-2000 Mark Murray <markm@FreeBSD.org>

Nifty idea from Jeroen van Gelderen; don't call a routine to check if
we are using the /dev/zero device, just check a flag (supplied by
/dev/zero).
Reviewed by: dfr


# 24964514 21-May-2000 Peter Wemm <peter@FreeBSD.org>

Checkpoint of a new physical memory backed object type, that does not
have pv_entries. This is intended for very special circumstances,
eg: a certain database that has a 1GB shm segment mapped into 300
processes. That would consume 2GB of kvm just to hold the pv_entries
alone. This would not be used on systems unless the physical ram was
available, as it's not pageable.

This is a work-in-progress, but is a useful and functional checkpoint.
Matt has got some more fixes for it that will be committed soon.

Reviewed by: dillon


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# aa543039 22-Apr-2000 Garrett Wollman <wollman@FreeBSD.org>

Implement POSIX.1b shared memory objects. In this implementation,
shared memory objects are regular files; the shm_open(3) routine
uses fcntl(2) to set a flag on the descriptor which tells mmap(2)
to automatically apply MAP_NOSYNC.

Not objected to by: bde, dillon, dufault, jasone


# 5929bcfa 27-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 956f3135 26-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Spelling


# 9730a5da 27-Feb-2000 Paul Saab <ps@FreeBSD.org>

Add MAP_NOCORE to mmap(2), and MADV_NOCORE and MADV_CORE to madvise(2).
This
This feature allows you to specify if mmap'd data is included in
an application's corefile.

Change the type of eflags in struct vm_map_entry from u_char to
vm_eflags_t (an unsigned int).

Reviewed by: dillon,jdp,alfred
Approved by: jkh


# 1f6889a1 16-Feb-2000 Matthew Dillon <dillon@FreeBSD.org>

Fix null-pointer dereference crash when the system is intentionally
run out of KVM through a mmap()/fork() bomb that allocates hundreds
of thousands of vm_map_entry structures.

Add panic to make null-pointer dereference crash a little more verbose.

Add a new sysctl, vm.max_proc_mmap, which specifies the maximum number
of mmap()'d spaces (discrete vm_map_entry's in the process). The value
defaults to around 9000 for a 128MB machine. The test is scaled for the
number of processes sharing a vmspace (aka linux threads). Setting
the value to 0 disables the feature.

PR: kern/16573
Approved by: jkh


# 00d76afe 03-Jan-2000 Guido van Rooij <guido@FreeBSD.org>

Use MAP_NOSYNC for vnodes without any links in their filesystem.

This is necessary for vmware: it does not use an anonymous mmap for
the memory of the virtual system. In stead it creates a temp file an
unlinks it. For a 50 MB file, this results in a ot of syncing
every 30 seconds.

Reviewed by: Matthew Dillon <dillon@backplane.com>


# 4f79d873 11-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# b4309055 20-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

cleanup madvise code, add a few more sanity checks.

Reviewed by: Alan Cox <alc@cs.rice.edu>, dg@root.com


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 4738fa09 05-Jun-1999 Alan Cox <alc@FreeBSD.org>

vm_mmap:
Insure that device mappings get MAP_PREFAULT(_PARTIAL) set,
so that 4M page mappings are used when possible.

Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com>


# e972780a 16-May-1999 Alan Cox <alc@FreeBSD.org>

Add the options MAP_PREFAULT and MAP_PREFAULT_PARTIAL to vm_map_find/insert,
eliminating the need for the pmap_object_init_pt calls in imgact_* and
mmap.

Reviewed by: David Greenman <dg@root.com>


# ea41812f 15-May-1999 Alan Cox <alc@FreeBSD.org>

Remove prototypes for functions that don't exist anymore (vm_map.h).

Remove a useless argument from vm_map_madvise's interface (vm_map.c,
vm_map.h, and vm_mmap.c).

Remove a redundant test in vm_uiomove (vm_map.c).

Make two changes to vm_object_coalesce:

1. Determine whether the new range of pages actually overlaps
the existing object's range of pages before calling vm_object_page_remove.
(Prior to this change almost 90% of the calls to vm_object_page_remove
were to remove pages that were beyond the end of the object.)

2. Free any swap space allocated to removed pages.


# e5f13bdd 14-May-1999 Alan Cox <alc@FreeBSD.org>

Simplify vm_map_find/insert's interface: remove the MAP_COPY_NEEDED option.

It never makes sense to specify MAP_COPY_NEEDED without also specifying
MAP_COPY_ON_WRITE, and vice versa. Thus, MAP_COPY_ON_WRITE suffices.

Reviewed by: David Greenman <dg@root.com>


# 4d38e6b5 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Add brackets to silence egcs and help clarity.


# d28ab90f 05-May-1999 Luoqi Chen <luoqi@FreeBSD.org>

Don't ignore mmap() address hint below the text section.


# f711d546 27-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


# db42d908 19-Apr-1999 Peter Wemm <peter@FreeBSD.org>

unifdef -DVM_STACK - it's been on for a while for x86 and was checked
and appeared to be working for the Alpha some time ago.


# dd2622a8 02-Mar-1999 Alan Cox <alc@FreeBSD.org>

To avoid a conflict for the vm_map's lock with vm_fault, release
the read lock around the subyte operations in mincore. After the lock is
reacquired, use the map's timestamp to determine if we need to restart
the scan.


# eff50fcd 01-Mar-1999 Alan Cox <alc@FreeBSD.org>

mincore doesn't modify the vm_map. Therefore, it doesn't require
an exclusive lock. A read lock will suffice.


# b1028ad1 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 9fdfe602 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.


# 2907af2a 25-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Mostly remove the VM_STACK OPTION.
This changes the definitions of a few items so that structures are the
same whether or not the option itself is enabled. This allows
people to enable and disable the option without recompilng the world.

As the author says:

|I ran into a problem pulling out the VM_STACK option. I was aware of this
|when I first did the work, but then forgot about it. The VM_STACK stuff
|has some code changes in the i386 branch. There need to be corresponding
|changes in the alpha branch before it can come out completely.

what is done:
|
|1) Pull the VM_STACK option out of the header files it appears in. This
|really shouldn't affect anything that executes with or without the rest
|of the VM_STACK patches. The vm_map_entry will then always have one
|extra element (avail_ssize). It just won't be used if the VM_STACK
|option is not turned on.
|
|I've also pulled the option out of vm_map.c. This shouldn't harm anything,
|since the routines that are enabled as a result are not called unless
|the VM_STACK option is enabled elsewhere.
|
|2) Add what appears to be appropriate code the the alpha branch, still
|protected behind the VM_STACK switch. I don't have an alpha machine,
|so we would need to get some testers with alpha machines to try it out.
|
|Once there is some testing, we can consider making the change permanent
|for both i386 and alpha.
|
[..]
|
|Once the alpha code is adequately tested, we can pull VM_STACK out
|everywhere.
|

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 2267af78 06-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Add (but don't activate) code for a special VM option to make
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.

The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.

Submitted by: Richard Seaman <dick@tar.com>


# fc565456 09-Dec-1998 Dmitrij Tejblum <dt@FreeBSD.org>

Don't disable mmap with large file offset.


# 6cde7a16 13-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# a23d65bf 14-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Cast pointers to uintptr_t/intptr_t instead of to u_long/long,
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.


# 711458e3 05-Jul-1998 Doug Rabson <dfr@FreeBSD.org>

Don't truncate the return value of mmap to sizeof(int).


# be160d60 21-Jun-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 4183b6b6 19-May-1998 Peter Wemm <peter@FreeBSD.org>

Make the previous commit compile..


# 05feb99f 18-May-1998 Guido van Rooij <guido@FreeBSD.org>

Plug hole reported on Bugtraq: do not allow mmap with WRITE privs for
append-only and immutable files.

Obtained from: OpenBSD (partly)


# c8bdd56b 12-Mar-1998 Guido van Rooij <guido@FreeBSD.org>

Fix for mmap of char devices bug as described in OpenBSD advisory of
1998/02/20
Reviewed by: John Dyson
Submitted by: "Cy Schubert" <cschuber@uumail.gov.bc.ca>


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 651bb817 30-Dec-1997 Alexander Langer <alex@FreeBSD.org>

caddr_t --> void *


# 5591b823d 16-Dec-1997 Eivind Eklund <eivind@FreeBSD.org>

Make COMPAT_43 and COMPAT_SUNOS new-style options.


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# 79624e21 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 54f42e4b 30-Aug-1997 Peter Wemm <peter@FreeBSD.org>

Allow non-page aligned file offset mmap's, providing that the system is
allowed to choose the address, or that the MAP_FIXED address has the same
remainder when modulo PAGE_SIZE as the file offset. Apparently this is
posix1003.1b specified behavior. SVR4 and the other *BSD's allow it too.
It costs us nothing to support and means we don't get EINVAL on some mmap
code that works perfectly elsewhere.

Obtained from: NetBSD


# b9dcd593 25-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Fixed type mismatches for functions with args of type vm_prot_t and/or
vm_inherit_t. These types are smaller than ints, so the prototypes
should have used the promoted type (int) to match the old-style function
definitions. They use just vm_prot_t and/or vm_inherit_t. This depends
on gcc features to work. I fixed the definitions since this is easiest.
The correct fix may be to change the small types to u_int, to optimize
for time instead of space.


# 0a0a85b3 16-Jul-1997 John Dyson <dyson@FreeBSD.org>

Add support for 4MB pages. This includes the .text, .data, .data parts
of the kernel, and also most of the dynamic parts of the kernel. Additionally,
4MB pages will be allocated for display buffers as appropriate (only.)

The 4MB support for SMP isn't complete, but doesn't interfere with operation
either.


# 4a40e3d4 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Correct the return code for the mlock system call. Also add the stubs
for mlockall and munlockall.


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# afa07f7e 15-Jan-1997 John Dyson <dyson@FreeBSD.org>

Change the map entry flags from bitfields to bitmasks. Allows
for some code simplification.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 9b5a5d81 11-Jan-1997 John Dyson <dyson@FreeBSD.org>

Prepare better for multi-platform by eliminating another required
pmap routine (pmap_is_referenced.) Upper level recoded to use
pmap_ts_referenced.


# d0aea04f 29-Dec-1996 John Dyson <dyson@FreeBSD.org>

Let the VM system know that on certain arch's that VM_PROT_READ
also implies VM_PROT_EXEC. We support it that way for now,
since the break system call by default gives VM_PROT_ALL. Now
we have a better chance of coalesing map entries when mixing
mmap/break type operations. This was contributing to excessive
numbers of map entries on the modula-3 runtime system. The
problem is still not "solved", but the situation makes more
sense.

Eventually, when we work on architectures where VM_PROT_READ
is orthogonal to VM_PROT_EXEC, we will have to visit this
issue carefully (esp. regarding security issues.)


# 94328e90 28-Dec-1996 John Dyson <dyson@FreeBSD.org>

The code unnecessarily created an object with no handle up-front, which
has the negative effect of disabling some map optimizations. This
patch defers the creation of the object until it needs to be at fault time.
Submitted by: Alan Cox <alc@cs.rice.edu>


# e9822d92 22-Dec-1996 Joerg Wunsch <joerg@FreeBSD.org>

Make DFLDSIZ and MAXDSIZ fully-supported options.

"Don't forget to do a ``make depend''" :-)


# 7aaaa4fd 14-Dec-1996 John Dyson <dyson@FreeBSD.org>

Implement closer-to POSIX mlock semantics. The major difference is
that we do allow mlock to span unallocated regions (of course, not
mlocking them.) We also allow mlocking of RO regions (which the old
code couldn't.) The restriction there is that once a RO region is
wired (mlocked), it cannot be debugged (or EVER written to.)

Under normal usage, the new mlock code will be a significant improvement
over our old stuff.


# 851c12ff 29-Oct-1996 John Dyson <dyson@FreeBSD.org>

Change mmap to use OBJT_DEFAULT instead of OBJT_SWAP by default
for anonymous objects. The system will automatically change the
type to SWAP if needed (for size or pageout reasons.)


# fcae040b 23-Oct-1996 John Dyson <dyson@FreeBSD.org>

Remove a bogus optimization in the mmap code. It is superfluous,
and at best is the same speed as the unoptimized code. At worst, it
slows down trivial programs.


# 1111860c 13-Oct-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a stale comment.


# cd6eea25 19-Sep-1996 David Greenman <dg@FreeBSD.org>

Fixed bug with reversed trunc/round_page() in madvise...start must be
trunced, end must be rounded.


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 0f281c28 27-Jul-1996 David Greenman <dg@FreeBSD.org>

Slight performance tweak for previous commit.


# bf6dfc7b 27-Jul-1996 John Dyson <dyson@FreeBSD.org>

Allow sequentially created mmap'ed anonymous regions to coalesce. There
is little or no reason to create a swap pager for small mmap's. The
vm_map_insert code will automatically create a swap pager if the object
becomes too large. This fix, per a request from phk.


# feb32a8f 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

Remove experimental header file. My test-build must have picked it
up in an unexpected place.
Submitted by: jkh


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# f35329ac 30-May-1996 John Dyson <dyson@FreeBSD.org>

This commit is dual-purpose, to fix more of the pageout daemon
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.


# 867a482d 19-May-1996 John Dyson <dyson@FreeBSD.org>

Initial support for mincore and madvise. Both are almost fully
supported, except madvise does not page in with MADV_WILLNEED, and
MADV_DONTNEED doesn't force dirty pages out.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# aa8de40a 03-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Another sweep over the pmap/vm macros, this time with more focus on
the usage. I'm not satisfied with the naming, but now at least there is
less bogus stuff around.


# 8f2ec877 16-Mar-1996 David Greenman <dg@FreeBSD.org>

Force device mappings to always be shared. It doesn't make sense for them
to ever be COW and we need the mappings to be shared for backward
compatibilty.

Reviewed by: dyson


# 5850152d 11-Mar-1996 John Dyson <dyson@FreeBSD.org>

Allow mmap'ed devices to work correctly across forks. The sanest
solution appeared to be to allow the child to maintain the same mapping as
the parent.


# 8169788f 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 9154ee6a 02-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Oops.. I nearly forgot the actual core of the length/rounding/etc fixes
that Bruce asked for.

These still are not quite perfect, and in particular, it can get
upset on extreme boundary cases (addr = 0xfff, len = 0xffffffff,
which would end up mapping a single page rather than failing), but
this is better code that I committed before.

(note, the VM system does not (apparently) support single mmap segment
sizes above 0x80000000 anyway)


# de5f6a77 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


# dabee6fe 23-Feb-1996 Peter Wemm <peter@FreeBSD.org>

kern_descrip.c: add fdshare()/fdcopy()
kern_fork.c: add the tiny bit of code for rfork operation.
kern/sysv_*: shmfork() takes one less arg, it was never used.
sys/shm.h: drop "isvfork" arg from shmfork() prototype
sys/param.h: declare rfork args.. (this is where OpenBSD put it..)
sys/filedesc.h: protos for fdshare/fdcopy.
vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where
it makes sense.
vm/*: drop unused isvfork arg.

Note: this rfork() implementation copies the address space mappings,
it does not connect the mappings together. ie: once the two processes
have split, the pages may be shared, but the address space is not. If one
does a mmap() etc, it does not appear in the other. This makes it not
useful for pthreads, but it is useful in it's own right for having
light-weight threads in a static shared address space.

Obtained from: Original by Ron Minnich, extended by OpenBSD


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# f2c6b65b 17-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Fixed 1TB filesize changes. Some pindexes had bogus names and types
but worked because vm_pindex_t is indistinuishable from vm_offset_t.


# 3048c512 12-Dec-1995 John Dyson <dyson@FreeBSD.org>

There was a bug that the size for an msync'ed region was not rounded
up. The effect of this was that msync with a size would generally sync
1 page less than it should. This problem was brought to my attention
by Darrel Herbst <dherbst@gradin.cis.upenn.edu> and Ron Minnich
<rminnich@sarnoff.com>.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# cac597e4 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.

Staticized some functions.

__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.


# d2d3e875 11-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


# e17bed12 22-Oct-1995 John Dyson <dyson@FreeBSD.org>

First phase of removing the PG_COPYONWRITE flag, and an architectural
cleanup of mapping files.


# 02c04a2f 21-Oct-1995 John Dyson <dyson@FreeBSD.org>

Implement mincore system call.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 06cb7259 09-Jul-1995 David Greenman <dg@FreeBSD.org>

Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 5f55e841 17-May-1995 David Greenman <dg@FreeBSD.org>

Accessing pages beyond the end of a mapped file results in internal
inconsistencies in the VM system that eventually lead to a panic. These
changes fix the behavior to conform to the behavior in SunOS, which is
to deny faults to pages beyond the EOF (returning SIGBUS). Internally,
this is implemented by requiring faults to be within the object size
boundaries. These changes exposed another bug, namely that passing in
an offset to mmap when trying to map an unnamed anonymous region also
results in internal inconsistencies. In this case, the offset is forced
to zero.

Reviewed by: John Dyson and others


# c3cb3e12 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


# 6c534ad8 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Fix logic bug I just introduced with the flags to msync().


# 1e62bc63 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Disallow both MS_ASYNC and MS_INVALIDATE flags being set at the same time
in msync().


# e6c6af11 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Added "flags" argument to msync, and implemented MS_ASYNC and MS_INVALIDATE.
The MS_ASYNC flag doesn't current work, and MS_INVALIDATE will only toss out
the pages in the address space (not all pages in the shadow chain).


# 8f4e17d4 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed bug in vm_mmap() where the object that is created in some cases
was the wrong size. This is the likely cause of panics reported by
Lars Fredriksen and Paul Richards related to a -1 blkno when paging
via the swap_pager.

Submitted by: John Dyson


# bc9ad247 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Disallow non page-aligned file offsets in vm_mmap(). We don't support this
in either the high or low level parts of the VM system. Just return EINVAL
in this case, just like SunOS does.


# fbcfcdf7 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed bug in the size == 0 case of msync() caused by a bogus return value
check..


# edf8a815 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed redundant newlines that were in some panic strings.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# c4ed5a07 12-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed obsolete comment.


# f2da180f 07-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed object reference count problem that occurred in the MAP_PRIVATE
case after we rewrote vm_mmap(). Added some comments to make it easier
to follow the reference counts.


# 50ce2102 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Rewrote MAP_PRIVATE case of vm_mmap() - all of the COW portion of this
routine was highly convoluted.

Submitted by: John Dyson


# 7fb0c17e 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_
pager(). Almost completely rewrote vm_mmap(); when John gets done with
the bottom half, it will be a complete rewrite. Deprecated most use of
vm_object_setpager(). Removed side effect of setting object persist
in vm_object_enter and moved this into the pager(s). A few other
cosmetic changes.


# ca40da74 15-Feb-1995 David Greenman <dg@FreeBSD.org>

Don't bother calling pmap_create() when creating the temporary map.
The whole COW section of vm_mmap() should be rewritten; the current
implementation is highly convoluted.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 05f0fdd2 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 90324b07 02-Sep-1994 David Greenman <dg@FreeBSD.org>

Whoops, accidently left out some pieces of the munmapfd patch.


# f720dc2c 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Enabled page table preloading of cached objects.

Submitted by: John Dyson


# bbc0ec52 03-Aug-1994 David Greenman <dg@FreeBSD.org>

Integrated VM system improvements/fixes from FreeBSD-1.1.5.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources