History log of /freebsd-9.3-release/sys/vm/vm_mmap.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 267654 19-Jun-2014 gjb

Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 266498 21-May-2014 pho

MFC r265534:

msync(2) must return ENOMEM and not EINVAL when the address is outside the
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.

Sponsored by: EMC / Isilon storage division


# 260208 02-Jan-2014 jhb

MFC 255708,255711,255731:
Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.


# 258870 03-Dec-2013 jhb

MFC 253471,253620,254430,254538:
Change mmap() to more optimally use superpages and provide support for
tweaking alignment of virtual mappings.
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
vm_map_find() that will try to alter the alignment of a mapping to match
any existing superpage mappings of the object being mapped. If no
suitable address range is found with the necessary alignment,
vm_map_find() will fall back to using the simple first-fit strategy
(VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), shm_map(), and the GEM mapping
ioctl to use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
Requests for n >= number of bits in a pointer or less than the size of
a page fail with EINVAL. This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used
to optimize the chances of using large pages. By default it will align
the mapping on a large page boundary (the system is free to choose any
large page size to align to that seems best for the mapping request).
However, if the object being mapped is already using large pages, then
it will align the virtual mapping to match the existing large pages in
the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
explicitly using VMFS_SUPER_SPACE. All device objects are forced to
use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
equivalent.

PR: ports/184173 (exp-run)


# 258144 14-Nov-2013 jhb

MFC 255497:
Fix an off-by-one error when populating mincore(2) entries for
skipped entries. lastvecindex references the last valid byte,
so the new bytes should come after it.


# 253801 30-Jul-2013 jlh

MFC r253554:
Fix a panic in the racct code when munlock(2) is called with incorrect values.

The racct code in sys_munlock() assumed that the boundaries provided by
the userland were correct as long as vm_map_unwire() returned
successfully. However the latter contains its own logic and sometimes
manages to do something out of those boundaries, even if they are buggy.
This change makes the racct code to use the accounting done by the vm
layer, as it is done in other places such as vm_mlock().

Despite fixing the panic, Alan Cox pointed that this code is still
race-y though: two simultaneous callers will produce incorrect values.

Reviewed by: alc

MFC r253556:
Fix previous commit when option RACCT is not used.

Approved by: re (kib)


# 249079 04-Apr-2013 kib

MFC r248815:
Release the v_writecount reference on the vnode in case of error,
before the vnode is vput() in vm_mmap_vnode().


# 245787 22-Jan-2013 zont

MFC r240145:
- Simplify VM code by using vmspace_wired_count() for counting wired
memory of a process.

MFC r245255:
- Reduce kernel size by removing unnecessary pointer indirections.

GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes.

MFC r245296:
- Improve readability of sys_obreak().

MFC r245421:
- Get rid of unused function vmspace_wired_count().


# 245420 14-Jan-2013 zont

- Fix r245416.
Turn unprivileged mlock off for compatibility. Exactly this behaviour
was approved by kib (mentor).

This is a direct commit.

Approved by: kib (mentor)


# 245416 14-Jan-2013 zont

MFC r244384:
- Fix locked memory accounting for maps with MAP_WIREFUTURE flag.
- Add sysctl vm.old_mlock which may turn such accounting off.

MFC r244385:
- Add sysctl to allow unprivileged users to call mlock(2)-family system
calls and turn it off for compatibility.
- Do not allow to call them inside jail.

Approved by: kib (mentor)


# 239577 22-Aug-2012 kib

MFC r239250:
For old mmap syscall, when executing on amd64 or ia64, enforce the
PROT_EXEC if prot is non-zero, process is 32bit and
kern.elf32.i386_read_exec syscal is enabled.


# 239573 22-Aug-2012 kib

MFC r239247:
Adjust the r205536, by allowing a non-zero offset for anonymous
mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD
1.1.5.1 can issue such requests.


# 234767 28-Apr-2012 kib

MFC r234556:
When MAP_STACK mapping is created, the map entry is created only to
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.

Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.


# 234766 28-Apr-2012 alc

MFC r234039
Fix mincore(2) so that it reports PG_CACHED pages as resident.


# 234763 28-Apr-2012 alc

MFC r232166
Simplify vm_mmap()'s control flow.

Add a comment describing what vm_mmap_to_errno() does.


# 233728 31-Mar-2012 kib

MFC r233100:
In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR: kern/165927


# 233001 15-Mar-2012 kib

MFC r232071:
Account the writeable shared mappings backed by file in the vnode
v_writecount.

MFC r232103:
Place the if() at the right location.

MFC note: the added struct vm_object un_pager.vnp.writemappings member
is located after the fields of struct vm_object that could be accessed
from the modules.


# 231889 17-Feb-2012 kib

MFC r231526:
Close a race due to dropping of the map lock between creating map entry
for a shared mapping and marking the entry for inheritance.


# 225736 22-Sep-2011 kensmith

Copy head to stable/9 as part of 9.0-RELEASE release cycle.

Approved by: re (implicit)


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 225418 06-Sep-2011 kib

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# 224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# 223914 10-Jul-2011 kib

Extract the code to translate VM error into errno, into an exported
function vm_mmap_to_errno(). It is useful for the drivers that implement
mmap(2)-like functionality, to be able to return error codes consistent
with mmap(2).

Sponsored by: The FreeBSD Foundation
No objections from: alc
MFC after: 1 week


# 223913 10-Jul-2011 kib

Style.

MFC after: 3 days


# 223825 06-Jul-2011 trasz

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# 220373 05-Apr-2011 trasz

Add accounting for most of the memory-related resources.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 218989 24-Feb-2011 pluknet

Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion.
As it was pointed out by Alan Cox, that no longer serves its purpose with
the modern UMA allocator compared to the old one used in 4.x days.

The removal of sysctl eliminates max_proc_mmap type overflow leading to
the broken mmap(2) seen with large amount of physical memory on arches
with factually unbound KVA space (such as amd64). It was found that
slightly less than 256GB of physmem was enough to trigger the overflow.

Reviewed by: alc, kib
Approved by: avg (mentor)
MFC after: 2 months


# 216186 04-Dec-2010 trasz

Fix comment intentation.


# 215321 14-Nov-2010 kib

Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by: swell.k gmail com
MFC after: 1 week


# 215309 14-Nov-2010 kib

Use symbolic names instead of hardcoding values for magic p_osrel constants.

MFC after: 1 week


# 212873 19-Sep-2010 alc

Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8. This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR: 150260
MFC after: 3 weeks


# 212282 07-Sep-2010 rstone

Fix a typo in r212281. uintptr -> uintptr_t

Pointy hat to: rstone

Approved by: emaste (mentor)
MFC after: 2 weeks


# 212281 06-Sep-2010 rstone

In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock. This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by: fabient
Approved by: emaste (mentor)
MFC after: 2 weeks


# 211937 28-Aug-2010 alc

Add the MAP_PREFAULT_READ option to mmap(2).

Reviewed by: jhb, kib


# 210923 06-Aug-2010 kib

Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with: pho
MFC after: 1 month


# 210548 27-Jul-2010 trasz

Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.


# 208574 26-May-2010 alc

Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced(). Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively. In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting. This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages(). This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition. Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by: kib


# 208504 24-May-2010 alc

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# 207410 29-Apr-2010 kmacy

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 207155 24-Apr-2010 alc

Resurrect pmap_is_referenced() and use it in mincore(). Essentially,
pmap_ts_referenced() is not always appropriate for checking whether or
not pages have been referenced because it clears any reference bits
that it encounters. For example, in mincore(), clearing the reference
bits has two negative consequences. First, it throws off the activity
count calculations performed by the page daemon. Specifically, a page
on which mincore() has called pmap_ts_referenced() looks less active
to the page daemon than it should. Consequently, the page could be
deactivated prematurely by the page daemon. Arguably, this problem
could be fixed by having mincore() duplicate the activity count
calculation on the page. However, there is a second problem for which
that is not a solution. In order to clear a reference on a 4KB page,
it may be necessary to demote a 2/4MB page mapping. Thus, a mincore()
by one process can have the side effect of demoting a superpage
mapping within another process!


# 205536 23-Mar-2010 jhb

Reject attempts to create a MAP_ANON mapping with a non-zero offset.

PR: kern/71258
Submitted by: Alexander Best
MFC after: 2 weeks


# 197712 02-Oct-2009 bz

Back out the functional parts from r197537. After r197711, affecting all
user mappings, mmap no longer needs special treatment.


# 197537 27-Sep-2009 simon

Do not allow mmap with the MAP_FIXED argument to map at address zero.
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities. While this of course does not fix vulnerabilities,
it does mitigate their impact.

Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.

This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.

Discussed with: rwatson, bz
Tested by: bz (Wine), simon (VirtualBox)
Submitted by: jhb


# 197348 20-Sep-2009 kib

Old (a.out) rtld attempts to mmap zero-length region, e.g. when bss
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.

For a.out and pre-8 ELF binaries, allow the mmap of zero length.

Reported by: tegge
Reviewed by: tegge, alc, jhb
MFC after: 3 days


# 195693 14-Jul-2009 jhb

- Change mmap() to fail requests with EINVAL that pass a length of 0. This
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.

Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week


# 194766 23-Jun-2009 kib

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 193275 01-Jun-2009 jhb

Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
added to cdevsw. This function is called for each mmap() request.
If it returns ENODEV, then the mmap() request will fall back to using
the device's device pager object and d_mmap(). Otherwise, the method
can return a VM object to satisfy this entire mmap() request via
*object. It can also modify the starting offset into this object via
*foff. This allows device drivers to use the file offset as a cookie
to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
mapping V_CHR vnodes. This avoids duplicating all the cdev mmap
handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02. Older device drivers
using D_VERSION_01 are still supported.

MFC after: 1 month


# 190705 04-Apr-2009 alc

Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,
but I see no benefit from it today.

VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission. On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission. The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c
revision 1.56.)

Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list. So,
the need for coalescing vm map entries is not as great as it once was.


# 189015 24-Feb-2009 kib

Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by: pho
Reviewed by: alc


# 188334 08-Feb-2009 kib

Do not call vm_object_deallocate() from vm_map_delete(), because we
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().

Reviewed by: tegge, alc
Tested by: pho


# 187527 21-Jan-2009 jhb

Now that vfs_markatime() no longer requires an exclusive lock due to
the VOP_MARKATIME() changes, use a shared vnode lock for mmap().

Submitted by: ups


# 184168 22-Oct-2008 rwatson

Update mmap() comment: no more block devices, so no more block device
cache coherency questions.

MFC after: 3 days


# 183216 20-Sep-2008 kib

Allow the d_mmap driver methods to use cdevpriv KPI during verification
phase of establishing mapping.

Discussed with: rwatson, jhb, rnoland
Tested by: rnoland
MFC after: 3 days


# 182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 181239 03-Aug-2008 trhodes

Fill in a few sysctl descriptions.

Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by: alc


# 179296 24-May-2008 alc

To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped. Specifically, it has
returned EINVAL if the entire range is not mapped. There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement. This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks


# 179076 17-May-2008 alc

In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping. Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address. Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address. If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated. In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().


# 178630 28-Apr-2008 alc

vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.


# 177458 20-Mar-2008 kib

Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.

PR: kern/119422
MFC after: 1 week


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 175164 08-Jan-2008 jhb

Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172779 18-Oct-2007 peter

Fix cosmetic bug in stale copy of msync_args. 'len' is size_t, not int.


# 171902 20-Aug-2007 kib

Do not drop vm_map lock between doing vm_map_remove() and vm_map_insert().
For this, introduce vm_map_fixed() that does that for MAP_FIXED case.

Dropping the lock allowed for parallel thread to occupy the freed space.

Reported by: Tijl Coosemans <tijl ulyssis org>
Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# 171212 04-Jul-2007 peter

Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate

Approved by: re (kensmith)


# 170864 17-Jun-2007 mjacob

Make sure object is NULL- there is a possible case where you could
fall through to it being used w/o being set. Put a break in the default
case.


# 170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 159837 21-Jun-2006 kib

Make the mincore(2) return ENOMEM when requested range is not fully mapped.

Requested by: Bruno Haible <bruno at clisp org>
Reviewed by: alc
Approved by: pjd (mentor)
MFC after: 1 month


# 157920 21-Apr-2006 trhodes

It seems that POSIX would rather ENODEV returned in place of EINVAL when
trying to mmap() an fd that isn't a normal file.

Reference: http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
Submitted by: fanf


# 157144 26-Mar-2006 jkoshy

MFP4: Support for profiling dynamically loaded objects.

Kernel changes:

Inform hwpmc of executable objects brought into the system by
kldload() and mmap(), and of their removal by kldunload() and
munmap(). A helper function linker_hwpmc_list_objects() has been
added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve
the list of currently loaded kernel modules.

The unused `MAPPINGCHANGE' event has been deprecated in favour
of separate `MAP_IN' and `MAP_OUT' events; this change reduces
space wastage in the log.

Bump the hwpmc's ABI version to "2.0.00". Teach hwpmc(4) to
handle the map change callbacks.

Change the default per-cpu sample buffer size to hold
32 samples (up from 16).

Increment __FreeBSD_version.

libpmc(3) changes:

Update libpmc(3) to deal with the new events in the log file; bring
the pmclog(3) manual page in sync with the code.

pmcstat(8) changes:

Introduce new options to pmcstat(8): "-r" (root fs path), "-M"
(mapfile name), "-q"/"-v" (verbosity control). Option "-k" now
takes a kernel directory as its argument but will also work with
the older invocation syntax.

Rework string handling in pmcstat(8) to use an opaque type for
interned strings. Clean up ELF parsing code and add support for
tracking dynamic object mappings reported by a v2.0.00 hwpmc(4).

Report statistics at the end of a log conversion run depending
on the requested verbosity level.

Reviewed by: jhb, dds (kernel parts of an earlier patch)
Tested by: gallatin (earlier patch)


# 151252 12-Oct-2005 dds

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 150926 04-Oct-2005 dds

Update the vnode's access time after an mmap operation on it.
Before this change a copy operation with cp(1) would not update the
file access times.

According to the POSIX mmap(2) documentation: the st_atime field
of the mapped file may be marked for update at any time between the
mmap() call and the corresponding munmap() call. The initial read
or write reference to a mapped region shall cause the file's st_atime
field to be marked for update if it has not already been marked for
update.


# 150397 20-Sep-2005 peter

Remove unused (but initialized) variable 'objsize' from vm_mmap()


# 145076 14-Apr-2005 csjp

Move MAC check_vnode_mmap entry point out from being exclusive to
MAP_SHARED so that the entry point gets executed un-conditionally.
This may be useful for security policies which want to perform access
control checks around run-time linking.

-add the mmap(2) flags argument to the check_vnode_mmap entry point
so that we can make access control decisions based on the type of
mapped object.
-update any dependent API around this parameter addition such as
function prototype modifications, entry point parameter additions
and the inclusion of sys/mman.h header file.
-Change the MLS, BIBA and LOMAC security policies so that subject
domination routines are not executed unless the type of mapping is
shared. This is done to maintain compatibility between the old
vm_mmap_vnode(9) and these policies.

Reviewed by: rwatson
MFC after: 1 month


# 144501 01-Apr-2005 jhb

- Change the vm_mmap() function to accept an objtype_t parameter specifying
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
vnodes and anonymous memory. Note that mmaping a cdev directly does not
currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
cdev the ioctl is acting on rather than trying to find a suitable vnode
to map from.

Reviewed by: alc, arch@


# 140782 24-Jan-2005 phk

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# 140723 24-Jan-2005 jeff

- Remove GIANT_REQUIRED where giant is no longer required.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.

Sponsored By: Isilon Systems, Inc.


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 136961 26-Oct-2004 phk

Don't clear flags we just checked were not set.


# 135727 24-Sep-2004 phk

XXX mark two places where we do not hold a threadcount on the dev when
frobbing the cdevsw.

In both cases we examine only the cdevsw and it is a good question if we
weren't better off copying those properties into the cdev in the first
place. This question will be revisited.


# 134615 01-Sep-2004 alc

Remove dead code.


# 133158 05-Aug-2004 phk

Remove a product specific workaround for wrong modes when mmap(2)'ing
devices. They have had plenty of time to adjust now.


# 132999 02-Aug-2004 alc

Eliminate the acquisition and release of Giant around the call to
pmap_mincore() in mincore(2). Either pmap locking exists (alpha, amd64,
i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).


# 130344 11-Jun-2004 phk

Deorbit COMPAT_SUNOS.

We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.


# 129110 11-May-2004 tjr

To handle orphaned character device vnodes properly in mmap(), check that
v_mount is non-null before dereferencing it. If it's null, behave as if
MNT_NOEXEC was not set on the mount that originally containined it.


# 127961 06-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 127879 05-Apr-2004 kan

Delay permission checks for VCHR vnodes until after vnode is locked in
vm_mmap_vnode function, where we can safely check for a special /dev/zero
case. Rev. 1.180 has reordered checks and introduced a regression.

Submitted by: alc
Was broken by: kan


# 127187 18-Mar-2004 guido

When mmap-ing a file from a noexec mount, be sure not to grant the right
to mmap it PROT_EXEC. This also depends on the architecture, as some
architextures (e.g. i386) do not distinguish between read and exec pages

Inspired by: http://linux.bkbits.net:8080/linux-2.4/cset@1.1267.1.85
Reviewed by: alc


# 127013 15-Mar-2004 truckman

Make overflow/wraparound checking more robust and unbreak len=0 in
vslock(), mlock(), and munlock().

Reviewed by: bde


# 127008 15-Mar-2004 truckman

Style(9) changes.

Pointed out by: bde


# 127006 15-Mar-2004 truckman

Remove redundant suser() check.


# 126668 05-Mar-2004 truckman

Undo the merger of mlock()/vslock and munlock()/vsunlock() and the
introduction of kern_mlock() and kern_munlock() in
src/sys/kern/kern_sysctl.c 1.150
src/sys/vm/vm_extern.h 1.69
src/sys/vm/vm_glue.c 1.190
src/sys/vm/vm_mmap.c 1.179
because different resource limits are appropriate for transient and
"permanent" page wiring requests.

Retain the kern_mlock() and kern_munlock() API in the revived
vslock() and vsunlock() functions.

Combine the best parts of each of the original sets of implementations
with further code cleanup. Make the mclock() and vslock()
implementations as similar as possible.

Retain the RLIMIT_MEMLOCK check in mlock(). Move the most strigent
test, which can return EAGAIN, last so that requests that have no
hope of ever being satisfied will not be retried unnecessarily.

Disable the test that can return EAGAIN in the vslock() implementation
because it will cause the sysctl code to wedge.

Tested by: Cy Schubert <Cy.Schubert AT komquats.com>


# 126424 01-Mar-2004 kan

Pich up a do {} while(0) cleanup by phk that was discarded accidentally in
previous revision.

Submitted by: alc


# 126332 27-Feb-2004 kan

Move the code dealing with vnode out of several functions into a single
helper function vm_mmap_vnode.

Discussed with: jeffr,alc (a while ago)


# 126253 25-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 123697 21-Dec-2003 alc

- Correct an error in mincore(2) that has existed since its introduction:
mincore(2) should check that the page is valid, not just allocated.
Otherwise, it can return a false positive for a page that is not yet
resident because it is being read from disk.


# 123280 08-Dec-2003 kan

Remove trailing whitespace.


# 123276 07-Dec-2003 alc

Addendum to revision 1.174: In the case where vm_pager_allocate() is called
to create a vnode-backed object, the vnode lock must be held by the caller.

Reported by: truckman
Discussed with: kan


# 123168 06-Dec-2003 alc

Fix a deadlock between vm_fault() and vm_mmap(): The expected lock ordering
between vm_map and vnode locks is that vm_map locks are acquired first. In
revision 1.150 mmap(2) was changed to pass a locked vnode into vm_mmap().
This creates a lock-order reversal when vm_mmap() calls one of the vm_map
routines that acquires a vm_map lock. The solution implemented herein is
to release the vnode lock in mmap() before calling vm_mmap() and reacquire
this lock if necessary in vm_mmap().

Approved by: re (scottl)
Reviewed by: jeff, kan, rwatson


# 122651 14-Nov-2003 alc

- Remove long dead code.


# 122646 14-Nov-2003 alc

Changes to msync(2)
- Return EBUSY if the region was wired by mlock(2) and MS_INVALIDATE
is specified to msync(2). This is required by the Open Group Base
Specifications Issue 6.
- vm_map_sync() doesn't return KERN_FAILURE. Thus, msync(2) can't
possibly return EIO.
- The second major loop in vm_map_sync() handles sub maps. Thus,
failing on sub maps in the first major loop isn't necessary.


# 122384 09-Nov-2003 alc

- The Open Group Base Specifications Issue 6 specifies that an munmap(2)
must return EINVAL if size is zero. Submitted by: tegge
- In order to avoid a race condition in multithreaded applications, the
check and removal operations by munmap(2) must be in the same critical
section. To accomodate this, vm_map_check_protection() is modified to
require its caller to obtain at least a read lock on the map.


# 122367 09-Nov-2003 alc

- Remove Giant from msync(2). Giant is still acquired by the lower layers
if we drop into the pmap or vnode layers.
- Migrate the handling of zero-length msync(2)s into vm_map_sync() so that
multithread applications can't change the map between implementing the
zero-length hack in msync(2) and reacquiring the map lock in
vm_map_sync().

Reviewed by: tegge


# 122349 09-Nov-2003 alc

- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact
that msync(2) is its only caller.
- Migrate the parts of the old vm_map_clean() that examined the internals
of a vm object to a new function vm_object_sync() that is implemented in
vm_object.c. At the same, introduce the necessary vm object locking so
that vm_map_sync() and vm_object_sync() can be called without Giant.

Reviewed by: tegge


# 120837 05-Oct-2003 bms

Only the super-user should be able to wire pages via the mlock() family
of system calls at this time. Remove various #ifdef's to enforce this.


# 120531 27-Sep-2003 marcel

Part 2 of implementing rstacks: add the ability to create rstacks and
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.

Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.

Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.

Tested on: alpha, i386, ia64, sparc64


# 120422 24-Sep-2003 peter

Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.


# 119858 07-Sep-2003 alc

Revise the locking in mincore(2).


# 118771 11-Aug-2003 bms

Add the mlockall() and munlockall() system calls.
- All those diffs to syscalls.master for each architecture *are*
necessary. This needed clarification; the stub code generation for
mlockall() was disabled, which would prevent applications from
linking to this API (suggested by mux)
- Giant has been quoshed. It is no longer held by the code, as
the required locking has been pushed down within vm_map.c.
- Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
to express their intention explicitly.
- Inspected at the vmstat, top and vm pager sysctl stats level.
Paging-in activity is occurring correctly, using a test harness.
- The RES size for a process may appear to be greater than its SIZE.
This is believed to be due to mappings of the same shared library
page being wired twice. Further exploration is needed.
- Believed to back out of allocations and locks correctly
(tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).

PR: kern/43426, standards/54223
Reviewed by: jake, alc
Approved by: jake (mentor)
MFC after: 2 weeks


# 117224 04-Jul-2003 phk

Remove unnecessary cast.


# 116678 22-Jun-2003 phk

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


# 116653 21-Jun-2003 phk

Use a do {...} while (0); and a couple of breaks to reduce the level
of indentation a bit.


# 116226 11-Jun-2003 obrien

Use __FBSDID().


# 116080 09-Jun-2003 alc

Hold the vm object's lock when performing vm_page_lookup().


# 113639 17-Apr-2003 jhb

suser() does not need the proc lock, just the setting of P_PROTECTED in
p_flag needs the lock.


# 112881 31-Mar-2003 wes

Add a facility allowing processes to inform the VM subsystem they are
critical and should not be killed when pageout is looking for more
memory pages in all the wrong places.

Reviewed by: arch@
Sponsored by: St. Bernard Software


# 112835 29-Mar-2003 mux

The object type can't be OBJT_PHYS in vm_mmap().

Reviewed by: peter


# 109153 12-Jan-2003 dillon

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# 109123 11-Jan-2003 dillon

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# 107370 28-Nov-2002 alc

Lock page field accesses in mincore().

Approved by: re (blanket)


# 105718 22-Oct-2002 rwatson

Invoke mac_check_vnode_mmap() during mmap operations on vnodes,
permitting policies to restrict access to memory mapping based on
the credential requesting the mapping, the target vnode, the
requested rights, or other policy considerations.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 103767 21-Sep-2002 jake

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


# 99509 06-Jul-2002 jeff

- Hold a lock on the vnode acquired from the file table across the call to
vm_mmap() as well as the GETATTR etc.
- If the handle is a vnode in vm_mmap() assert that it is locked.
- Wiggle Giant around a little to account for the extra vnode operation.


# 98833 25-Jun-2002 dillon

Part I of RLIMIT_VMEM implementation. Implement core functionality for
a new resource limit that covers a process's entire VM space, including
mmap()'d space.

(Part II will be additional code to check RLIMIT_VMEM during exec() but it
needs more fleshing out).

PR: kern/18209
Submitted by: Andrey Alekseyev <uitm@zenon.net>, Dmitry Kim <jason@nichego.net>
MFC after: 7 days


# 98656 22-Jun-2002 alc

o Remove the unnecessary acquisition and release of Giant around fdrop()
in mmap(2).


# 98632 22-Jun-2002 alc

o Reduce the scope of Giant in vm_mmap() to just the code that manipulates
a vnode. (Thus, MAP_ANON and MAP_STACK never acquire Giant.)


# 98304 16-Jun-2002 alc

o Remove GIANT_REQUIRED from vm_fault_user_wire().
o Move pmap_pageable() outside of Giant in vm_fault_unwire().
(pmap_pageable() is a no-op on all supported architectures.)
o Remove the acquisition and release of Giant from mlock().


# 98240 15-Jun-2002 alc

o Remove the acquisition and release of Giant from munlock().

Reviewed by: tegge


# 98226 14-Jun-2002 alc

o Use vm_map_wire() and vm_map_unwire() in place of vm_map_pageable() and
vm_map_user_pageable().
o Remove vm_map_pageable() and vm_map_user_pageable().
o Remove vm_map_clear_recursive() and vm_map_set_recursive(). (They were
only used by vm_map_pageable() and vm_map_user_pageable().)

Reviewed by: tegge


# 97947 06-Jun-2002 alfred

fix typo in _SYS_SYSPROTO_H_ case: s/mlockall_args/munlockall_args

Submitted by: Mark Santcroos <marks@ripe.net>


# 97556 30-May-2002 alfred

Check for defined(__i386__) instead of just defined(i386) since the compiler
will be updated to only define(__i386__) for ANSI cleanliness.


# 97294 26-May-2002 alc

o Acquire and release Giant around pmap operations in vm_fault_unwire()
and vm_map_delete(). Assert GIANT_REQUIRED in vm_map_delete()
only if operating on the kernel_object or the kmem_object.
o Remove GIANT_REQUIRED from vm_map_remove().
o Remove the acquisition and release of Giant from munmap().


# 96875 18-May-2002 alc

o Eliminate the acquisition and release of Giant from minherit(2).
(vm_map_inherit() no longer requires Giant to be held.)


# 96839 18-May-2002 alc

o Remove GIANT_REQUIRED from vm_map_madvise(). Instead, acquire and
release Giant around vm_map_madvise()'s call to pmap_object_init_pt().
o Replace GIANT_REQUIRED in vm_object_madvise() with the acquisition
and release of Giant.
o Remove the acquisition and release of Giant from madvise().


# 96832 18-May-2002 alc

o Remove the acquisition and release of Giant from mprotect().


# 96007 04-May-2002 alc

o Remove GIANT_REQUIRED from vm_map_lookup_entry() and
vm_map_check_protection().
o Call vm_map_check_protection() without Giant held in munmap().


# 93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 92727 19-Mar-2002 alfred

Remove __P.


# 92029 10-Mar-2002 eivind

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 90702 15-Feb-2002 bde

Garbage-collect options ACPI_NO_ENABLE_ON_BOOT, AML_DEBUG, BLEED,
DEVICE_SYSCTLS, KEY, LOUTB, NFS_MUIDHASHSIZ, NFS_UIDHASHSIZ, PCI_QUIET
and SIMPLELOCK_DEBUG.


# 89319 13-Jan-2002 alfred

Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().


# 89306 13-Jan-2002 alfred

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# 84783 10-Oct-2001 ps

Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader
tunable.

Reviewed by: peter
MFC after: 2 weeks


# 83986 26-Sep-2001 rwatson

o Modify access control checks in mmap() to use securelevel_gt() instead
of direct variable access.

Obtained from: TrustedBSD Project


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82612 30-Aug-2001 dillon

Cleanup


# 82290 24-Aug-2001 dillon

Remove support for the badly broken MAP_INHERIT (from -current only).


# 79242 04-Jul-2001 dillon

whitespace / register cleanup


# 79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 77139 24-May-2001 jhb

Stick VM syscalls back under Giant if the BLEED option is not defined.


# 77083 23-May-2001 jhb

- Obtain Giant in mmap() syscall while messing with file descriptors and
vnodes.
- Fix an old bug that would leak a reference to a fd if the vnode being
mmap'd wasn't of type VREG or VCHR.
- Lock Giant in vm_mmap() around calls into the VM that can call into
pager routines that need Giant or into other VM routines that need
Giant.
- Replace code that used a goto to jump around the else branch of a test
to use an else branch instead.


# 76974 22-May-2001 jhb

Unlock the VM lock at the end of munlock() instead of locking it again.


# 76827 18-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 68883 18-Nov-2000 dillon

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 65770 12-Sep-2000 bp

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 63897 26-Jul-2000 mckusick

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


# 62067 25-Jun-2000 markm

Nifty idea from Jeroen van Gelderen; don't call a routine to check if
we are using the /dev/zero device, just check a flag (supplied by
/dev/zero).
Reviewed by: dfr


# 60757 21-May-2000 peter

Checkpoint of a new physical memory backed object type, that does not
have pv_entries. This is intended for very special circumstances,
eg: a certain database that has a 1GB shm segment mapped into 300
processes. That would consume 2GB of kvm just to hold the pv_entries
alone. This would not be used on systems unless the physical ram was
available, as it's not pageable.

This is a work-in-progress, but is a useful and functional checkpoint.
Matt has got some more fixes for it that will be committed soon.

Reviewed by: dillon


# 60755 21-May-2000 peter

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# 59496 22-Apr-2000 wollman

Implement POSIX.1b shared memory objects. In this implementation,
shared memory objects are regular files; the shm_open(3) routine
uses fcntl(2) to set a flag on the descriptor which tells mmap(2)
to automatically apply MAP_NOSYNC.

Not objected to by: bde, dillon, dufault, jasone


# 58705 27-Mar-2000 charnier

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 58634 26-Mar-2000 charnier

Spelling


# 57550 28-Feb-2000 ps

Add MAP_NOCORE to mmap(2), and MADV_NOCORE and MADV_CORE to madvise(2).
This
This feature allows you to specify if mmap'd data is included in
an application's corefile.

Change the type of eflags in struct vm_map_entry from u_char to
vm_eflags_t (an unsigned int).

Reviewed by: dillon,jdp,alfred
Approved by: jkh


# 57263 16-Feb-2000 dillon

Fix null-pointer dereference crash when the system is intentionally
run out of KVM through a mmap()/fork() bomb that allocates hundreds
of thousands of vm_map_entry structures.

Add panic to make null-pointer dereference crash a little more verbose.

Add a new sysctl, vm.max_proc_mmap, which specifies the maximum number
of mmap()'d spaces (discrete vm_map_entry's in the process). The value
defaults to around 9000 for a 128MB machine. The test is scaled for the
number of processes sharing a vmspace (aka linux threads). Setting
the value to 0 disables the feature.

PR: kern/16573
Approved by: jkh


# 55351 03-Jan-2000 guido

Use MAP_NOSYNC for vnodes without any links in their filesystem.

This is necessary for vmware: it does not use an anonymous mmap for
the memory of the virtual system. In stead it creates a temp file an
unlinks it. For a 50 MB file, this results in a ot of syncing
every 30 seconds.

Reviewed by: Matthew Dillon <dillon@backplane.com>


# 54467 12-Dec-1999 dillon

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# 52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 51493 21-Sep-1999 dillon

cleanup madvise code, add a few more sanity checks.

Reviewed by: Alan Cox <alc@cs.rice.edu>, dg@root.com


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 47765 05-Jun-1999 alc

vm_mmap:
Insure that device mappings get MAP_PREFAULT(_PARTIAL) set,
so that 4M page mappings are used when possible.

Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com>


# 47258 16-May-1999 alc

Add the options MAP_PREFAULT and MAP_PREFAULT_PARTIAL to vm_map_find/insert,
eliminating the need for the pmap_object_init_pt calls in imgact_* and
mmap.

Reviewed by: David Greenman <dg@root.com>


# 47243 16-May-1999 alc

Remove prototypes for functions that don't exist anymore (vm_map.h).

Remove a useless argument from vm_map_madvise's interface (vm_map.c,
vm_map.h, and vm_mmap.c).

Remove a redundant test in vm_uiomove (vm_map.c).

Make two changes to vm_object_coalesce:

1. Determine whether the new range of pages actually overlaps
the existing object's range of pages before calling vm_object_page_remove.
(Prior to this change almost 90% of the calls to vm_object_page_remove
were to remove pages that were beyond the end of the object.)

2. Free any swap space allocated to removed pages.


# 47207 14-May-1999 alc

Simplify vm_map_find/insert's interface: remove the MAP_COPY_NEEDED option.

It never makes sense to specify MAP_COPY_NEEDED without also specifying
MAP_COPY_ON_WRITE, and vice versa. Thus, MAP_COPY_ON_WRITE suffices.

Reviewed by: David Greenman <dg@root.com>


# 46592 06-May-1999 peter

Add brackets to silence egcs and help clarity.


# 46538 05-May-1999 luoqi

Don't ignore mmap() address hint below the text section.


# 46112 27-Apr-1999 phk

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


# 45821 19-Apr-1999 peter

unifdef -DVM_STACK - it's been on for a while for x86 and was checked
and appeared to be working for the Alpha some time ago.


# 44438 02-Mar-1999 alc

To avoid a conflict for the vm_map's lock with vm_fault, release
the read lock around the subyte operations in mincore. After the lock is
reacquired, use the map's timestamp to determine if we need to restart
the scan.


# 44379 01-Mar-1999 alc

mincore doesn't modify the vm_map. Therefore, it doesn't require
an exclusive lock. A read lock will suffice.


# 44146 19-Feb-1999 luoqi

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 43748 07-Feb-1999 dillon

Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.


# 43209 26-Jan-1999 julian

Mostly remove the VM_STACK OPTION.
This changes the definitions of a few items so that structures are the
same whether or not the option itself is enabled. This allows
people to enable and disable the option without recompilng the world.

As the author says:

|I ran into a problem pulling out the VM_STACK option. I was aware of this
|when I first did the work, but then forgot about it. The VM_STACK stuff
|has some code changes in the i386 branch. There need to be corresponding
|changes in the alpha branch before it can come out completely.

what is done:
|
|1) Pull the VM_STACK option out of the header files it appears in. This
|really shouldn't affect anything that executes with or without the rest
|of the VM_STACK patches. The vm_map_entry will then always have one
|extra element (avail_ssize). It just won't be used if the VM_STACK
|option is not turned on.
|
|I've also pulled the option out of vm_map.c. This shouldn't harm anything,
|since the routines that are enabled as a result are not called unless
|the VM_STACK option is enabled elsewhere.
|
|2) Add what appears to be appropriate code the the alpha branch, still
|protected behind the VM_STACK switch. I don't have an alpha machine,
|so we would need to get some testers with alpha machines to try it out.
|
|Once there is some testing, we can consider making the change permanent
|for both i386 and alpha.
|
[..]
|
|Once the alpha code is adequately tested, we can pull VM_STACK out
|everywhere.
|

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 42360 06-Jan-1999 julian

Add (but don't activate) code for a special VM option to make
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.

The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.

Submitted by: Richard Seaman <dick@tar.com>


# 41620 09-Dec-1998 dt

Don't disable mmap with large file offset.


# 40286 13-Oct-1998 dg

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# 38799 04-Sep-1998 dfr

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 38517 24-Aug-1998 dfr

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 37649 15-Jul-1998 bde

Cast pointers to uintptr_t/intptr_t instead of to u_long/long,
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.


# 37395 05-Jul-1998 dfr

Don't truncate the return value of mmap to sizeof(int).


# 37101 21-Jun-1998 bde

Removed unused includes.


# 36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 36177 19-May-1998 peter

Make the previous commit compile..


# 36164 18-May-1998 guido

Plug hole reported on Bugtraq: do not allow mmap with WRITE privs for
append-only and immutable files.

Obtained from: OpenBSD (partly)


# 34525 12-Mar-1998 guido

Fix for mmap of char devices bug as described in OpenBSD advisory of
1998/02/20
Reviewed by: John Dyson
Submitted by: "Cy Schubert" <cschuber@uumail.gov.bc.ca>


# 34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32132 31-Dec-1997 alex

caddr_t --> void *


# 31778 16-Dec-1997 eivind

Make COMPAT_43 and COMPAT_SUNOS new-style options.


# 30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# 28992 01-Sep-1997 bde

Removed unused #includes.


# 28940 30-Aug-1997 peter

Allow non-page aligned file offset mmap's, providing that the system is
allowed to choose the address, or that the MAP_FIXED address has the same
remainder when modulo PAGE_SIZE as the file offset. Apparently this is
posix1003.1b specified behavior. SVR4 and the other *BSD's allow it too.
It costs us nothing to support and means we don't get EINVAL on some mmap
code that works perfectly elsewhere.

Obtained from: NetBSD


# 28751 25-Aug-1997 bde

Fixed type mismatches for functions with args of type vm_prot_t and/or
vm_inherit_t. These types are smaller than ints, so the prototypes
should have used the promoted type (int) to match the old-style function
definitions. They use just vm_prot_t and/or vm_inherit_t. This depends
on gcc features to work. I fixed the definitions since this is easiest.
The correct fix may be to change the small types to u_int, to optimize
for time instead of space.


# 27464 17-Jul-1997 dyson

Add support for 4MB pages. This includes the .text, .data, .data parts
of the kernel, and also most of the dynamic parts of the kernel. Additionally,
4MB pages will be allocated for display buffers as appropriate (only.)

The 4MB support for SMP isn't complete, but doesn't interfere with operation
either.


# 26668 15-Jun-1997 dyson

Correct the return code for the mlock system call. Also add the stubs
for mlockall and munlockall.


# 24131 23-Mar-1997 bde

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21754 16-Jan-1997 dyson

Change the map entry flags from bitfields to bitmasks. Allows
for some code simplification.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 21529 11-Jan-1997 dyson

Prepare better for multi-platform by eliminating another required
pmap routine (pmap_is_referenced.) Upper level recoded to use
pmap_ts_referenced.


# 21039 30-Dec-1996 dyson

Let the VM system know that on certain arch's that VM_PROT_READ
also implies VM_PROT_EXEC. We support it that way for now,
since the break system call by default gives VM_PROT_ALL. Now
we have a better chance of coalesing map entries when mixing
mmap/break type operations. This was contributing to excessive
numbers of map entries on the modula-3 runtime system. The
problem is still not "solved", but the situation makes more
sense.

Eventually, when we work on architectures where VM_PROT_READ
is orthogonal to VM_PROT_EXEC, we will have to visit this
issue carefully (esp. regarding security issues.)


# 20991 28-Dec-1996 dyson

The code unnecessarily created an object with no handle up-front, which
has the negative effect of disabling some map optimizations. This
patch defers the creation of the object until it needs to be at fault time.
Submitted by: Alan Cox <alc@cs.rice.edu>


# 20821 22-Dec-1996 joerg

Make DFLDSIZ and MAXDSIZ fully-supported options.

"Don't forget to do a ``make depend''" :-)


# 20449 14-Dec-1996 dyson

Implement closer-to POSIX mlock semantics. The major difference is
that we do allow mlock to span unallocated regions (of course, not
mlocking them.) We also allow mlocking of RO regions (which the old
code couldn't.) The restriction there is that once a RO region is
wired (mlocked), it cannot be debugged (or EVER written to.)

Under normal usage, the new mlock code will be a significant improvement
over our old stuff.


# 19259 29-Oct-1996 dyson

Change mmap to use OBJT_DEFAULT instead of OBJT_SWAP by default
for anonymous objects. The system will automatically change the
type to SWAP if needed (for size or pageout reasons.)


# 19142 24-Oct-1996 dyson

Remove a bogus optimization in the mmap code. It is superfluous,
and at best is the same speed as the unoptimized code. At worst, it
slows down trivial programs.


# 18908 13-Oct-1996 phk

Remove a stale comment.


# 18389 19-Sep-1996 dg

Fixed bug with reversed trunc/round_page() in madvise...start must be
trunced, end must be rounded.


# 17334 30-Jul-1996 dyson

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 17313 28-Jul-1996 dg

Slight performance tweak for previous commit.


# 17301 27-Jul-1996 dyson

Allow sequentially created mmap'ed anonymous regions to coalesce. There
is little or no reason to create a swap pager for small mmap's. The
vm_map_insert code will automatically create a swap pager if the object
becomes too large. This fix, per a request from phk.


# 17297 27-Jul-1996 dyson

Remove experimental header file. My test-build must have picked it
up in an unexpected place.
Submitted by: jkh


# 17294 27-Jul-1996 dyson

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 16026 30-May-1996 dyson

This commit is dual-purpose, to fix more of the pageout daemon
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.


# 15819 19-May-1996 dyson

Initial support for mincore and madvise. Both are almost fully
supported, except madvise does not page in with MADV_WILLNEED, and
MADV_DONTNEED doesn't force dirty pages out.


# 15809 18-May-1996 dyson

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# 15583 03-May-1996 phk

Another sweep over the pmap/vm macros, this time with more focus on
the usage. I'm not satisfied with the naming, but now at least there is
less bogus stuff around.


# 14638 16-Mar-1996 dg

Force device mappings to always be shared. It doesn't make sense for them
to ever be COW and we need the mappings to be shared for backward
compatibilty.

Reviewed by: dyson


# 14574 12-Mar-1996 dyson

Allow mmap'ed devices to work correctly across forks. The sanest
solution appeared to be to allow the child to maintain the same mapping as
the parent.


# 14325 02-Mar-1996 peter

Oops.. I nearly forgot the actual core of the length/rounding/etc fixes
that Bruce asked for.

These still are not quite perfect, and in particular, it can get
upset on extreme boundary cases (addr = 0xfff, len = 0xffffffff,
which would end up mapping a single page rather than failing), but
this is better code that I committed before.

(note, the VM system does not (apparently) support single mmap segment
sizes above 0x80000000 anyway)


# 14316 02-Mar-1996 dyson

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


# 14221 23-Feb-1996 peter

kern_descrip.c: add fdshare()/fdcopy()
kern_fork.c: add the tiny bit of code for rfork operation.
kern/sysv_*: shmfork() takes one less arg, it was never used.
sys/shm.h: drop "isvfork" arg from shmfork() prototype
sys/param.h: declare rfork args.. (this is where OpenBSD put it..)
sys/filedesc.h: protos for fdshare/fdcopy.
vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where
it makes sense.
vm/*: drop unused isvfork arg.

Note: this rfork() implementation copies the address space mappings,
it does not connect the mappings together. ie: once the two processes
have split, the pages may be shared, but the address space is not. If one
does a mmap() etc, it does not appear in the other. This makes it not
useful for pthreads, but it is useful in it's own right for having
light-weight threads in a static shared address space.

Obtained from: Original by Ron Minnich, extended by OpenBSD


# 13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 12904 17-Dec-1995 bde

Fixed 1TB filesize changes. Some pindexes had bogus names and types
but worked because vm_pindex_t is indistinuishable from vm_offset_t.


# 12808 13-Dec-1995 dyson

There was a bug that the size for an msync'ed region was not rounded
up. The effect of this was that msync with a size would generally sync
1 page less than it should. This problem was brought to my attention
by Darrel Herbst <dherbst@gradin.cis.upenn.edu> and Ron Minnich
<rminnich@sarnoff.com>.


# 12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12591 03-Dec-1995 bde

Completed function declarations and/or added prototypes.

Staticized some functions.

__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.


# 12221 12-Nov-1995 bde

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


# 11705 23-Oct-1995 dyson

First phase of removing the PG_COPYONWRITE flag, and an architectural
cleanup of mapping files.


# 11621 21-Oct-1995 dyson

Implement mincore system call.


# 9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 9456 09-Jul-1995 dg

Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8585 18-May-1995 dg

Accessing pages beyond the end of a mapped file results in internal
inconsistencies in the VM system that eventually lead to a panic. These
changes fix the behavior to conform to the behavior in SunOS, which is
to deny faults to pages beyond the EOF (returning SIGBUS). Internally,
this is implemented by requiring faults to be within the object size
boundaries. These changes exposed another bug, namely that passing in
an offset to mmap when trying to map an unnamed anonymous region also
results in internal inconsistencies. In this case, the offset is forced
to zero.

Reviewed by: John Dyson and others


# 7883 16-Apr-1995 dg

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


# 7366 25-Mar-1995 dg

Fix logic bug I just introduced with the flags to msync().


# 7364 25-Mar-1995 dg

Disallow both MS_ASYNC and MS_INVALIDATE flags being set at the same time
in msync().


# 7360 25-Mar-1995 dg

Added "flags" argument to msync, and implemented MS_ASYNC and MS_INVALIDATE.
The MS_ASYNC flag doesn't current work, and MS_INVALIDATE will only toss out
the pages in the address space (not all pages in the shadow chain).


# 7239 22-Mar-1995 dg

Fixed bug in vm_mmap() where the object that is created in some cases
was the wrong size. This is the likely cause of panics reported by
Lars Fredriksen and Paul Richards related to a -1 blkno when paging
via the swap_pager.

Submitted by: John Dyson


# 7215 21-Mar-1995 dg

Disallow non page-aligned file offsets in vm_mmap(). We don't support this
in either the high or low level parts of the VM system. Just return EINVAL
in this case, just like SunOS does.


# 7209 21-Mar-1995 dg

Fixed bug in the size == 0 case of msync() caused by a bogus return value
check..


# 7170 19-Mar-1995 dg

Removed redundant newlines that were in some panic strings.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 7017 12-Mar-1995 dg

Fixed obsolete comment.


# 6944 07-Mar-1995 dg

Fixed object reference count problem that occurred in the MAP_PRIVATE
case after we rewrote vm_mmap(). Added some comments to make it easier
to follow the reference counts.


# 6617 22-Feb-1995 dg

Rewrote MAP_PRIVATE case of vm_mmap() - all of the COW portion of this
routine was highly convoluted.

Submitted by: John Dyson


# 6585 20-Feb-1995 dg

Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_
pager(). Almost completely rewrote vm_mmap(); when John gets done with
the bottom half, it will be a complete rewrite. Deprecated most use of
vm_object_setpager(). Removed side effect of setting object persist
in vm_object_enter and moved this into the pager(s). A few other
cosmetic changes.


# 6435 15-Feb-1995 dg

Don't bother calling pmap_create() when creating the temporary map.
The whole COW section of vm_mmap() should be rewritten; the current
implementation is highly convoluted.


# 5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 3449 08-Oct-1994 phk

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 2462 02-Sep-1994 dg

Whoops, accidently left out some pieces of the munmapfd patch.


# 1885 06-Aug-1994 dg

Enabled page table preloading of cached objects.

Submitted by: John Dyson


# 1827 04-Aug-1994 dg

Integrated VM system improvements/fixes from FreeBSD-1.1.5.


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources