History log of /freebsd-current/sys/vm/vm_object.c
Revision Date Author Comments
# 38f5f2a4 12-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

sysctl vm.objects/vm.swap_objects: do not fill vnode info if jailed

Reported by: Shawn Webb via markj
Reviewed by: jhb, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 69748e62 12-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

vm/vm_object.c: minor cleanup

Remove sys/cdefs.h and sys/socket.h includes.
Order sys/ includes alphabetically.
Do not check for NULL before free().

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
DIfferential revision: https://reviews.freebsd.org/D43444


# 6f3e9bac 15-Dec-2023 Pawel Jakub Dawidek <pjd@FreeBSD.org>

vm: Plug umtx shm object leak.

Reviewed by: kib
Approved by: oshogbo
MFC after: 1 week
Sponsored by: Fudo Security
Differential Revision: https://reviews.freebsd.org/D43073


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 3e04ae43 14-Jul-2023 Doug Moore <dougm@FreeBSD.org>

vm_radix_init: use initializer

Several vm_radix tries are not initialized with vm_radix_init. That
works, for now, since static initialization zeroes the root field
anyway, but if initialization changes, these tries will fail. Add
missing initializer calls.

Reviewed by: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D40971


# c3821149 15-Feb-2023 Ed Maste <emaste@FreeBSD.org>

Drop space in "vm object" lock name to improve wchan

Lock names are shown in top as a `*` followed by the first five
characters of the name. `*vmobj` a little more obvious and easier to
search for than `*vm ob`.

Differential Revision: https://reviews.freebsd.org/D36264


# 6189672e 18-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

Handle ERELOOKUP from VOP_FSYNC() in several other places

We need to repeat the operation if the vnode was relocked.

Reported and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38114


# 70e1b112 19-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

vm_object.c: minor style

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# b050ee6c 16-Jan-2023 Mark Johnston <markj@FreeBSD.org>

vm_object: Fix a kernel memory disclosure via the vm_object list sysctl

Reported by: Chris J-D <chris@accessvector.net>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 03d6764b 11-Nov-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: don't limit pindex output in 'show vmopag'

This command already prints a tremendous amount of output, and properly
obeys the pager. It no longer makes sense to arbitrarily limit the pages
that are printed, as the reader will not be aware that this has
happened.

Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D37361


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# 7f3c78fb 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_pager: Remove references to KVME_TYPE_DEFAULT in the kernel

Keep the definition around since it's used by userspace.

Reviewed by: alc, imp, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35791


# fff19e0e 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_object: Remove redundant OBJ_SWAP checks

With the removal of OBJT_DEFAULT, OBJ_ANON implies OBJ_SWAP.

Note, this means that vm_object_split() is more expensive than it used
to be, as it holds busy locks until the end of the range is reached,
even if the object has no swap blocks allocated.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35789


# 0cb2610e 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm: Remove handling for OBJT_DEFAULT objects

Now that OBJT_DEFAULT objects can't be instantiated, we can simplify
checks of the form object->type == OBJT_DEFAULT || (object->flags &
OBJ_SWAP) != 0. No functional change intended.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35788


# fffc1c59 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_object: Release object swap charge in the swap pager destructor

With the removal of OBJT_DEFAULT, we can simply handle this in
swap_pager_dealloc(). No functional change intended.

Suggested by: alc
Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35787


# 5d32157d 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_object: Modify vm_object_allocate_anon() to return OBJT_SWAP objects

With this change, OBJT_DEFAULT objects are no longer allocated.
Instead, anonymous objects are always of type OBJT_SWAP and always have
OBJ_SWAP set.

Modify the page fault handler to check the swap block radix tree in
places where it checked for objects of type OBJT_DEFAULT. In
particular, there's no need to invoke getpages for an OBJT_SWAP object
with no swap blocks assigned.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35785


# e1979b45 12-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_object: Assert that overcommit charge is released in the object dtor

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35780


# 630f633f 14-Jun-2022 Mark Johnston <markj@FreeBSD.org>

vm_object: Use the vm_object_(set|clear)_flag() helpers

... rather than setting and clearing flags inline. No functional change
intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35469


# 4cf9f5d8 17-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_object: restore handling of shadow_count for all type of objects

instead of only OBJ_ANON objects that are backing, as it is now.
This is required for e.g. vm_meter is_object_active() detection, and
should be useful in some more cases.

Use refcount KPI for all objects, regardless of owning the object lock,
and the fact that currently OBJ_ANON cannot change for the live object.

Noted and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33549


# 18048b6e 17-Dec-2021 Stephen J. Kiernan <stevek@FreeBSD.org>

Eliminate key press requirement "show vmopag" command output.

Summary:
One was required to press a key to continue after every 18 lines of
output. This requirement had been in the "show vmopag" command since it
was introduced, which was many years before paging was added to DDB.
With paging, this explict key check is no longer necessary.

Obtained from: Juniper Networks, Inc.
MFC after: 1 week

Test Plan:
Run "show vmopag" from db> prompt and see that it does not need additional
keypresses other than the ones needed for the pager.

Subscribers: imp, #contributor_reviews_base

Differential Revision: https://reviews.freebsd.org/D33550


# cd37afd8 19-Dec-2021 Rick Macklem <rmacklem@FreeBSD.org>

vm_object: Make is_object_active() global

Commit 867c27c23a5c modified the NFS client so that
it does IO_APPEND writes directly to the NFS server,
bypassing the buffer cache. However, this could result
in stale data in client pages when the file is mmap(2)'d.
As such, the NFS client needs to call is_object_active()
to check if the file is mmap(2)'d.

This patch renames is_object_active() to vm_object_is_active(),
moves it to sys/vm/vm_object.c and makes it global, so that
the NFS client can call it in a future commit.

Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33520


# d28af1ab 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm: Add a mode to vm_object_page_remove() which skips invalid pages

This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().

Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.

PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32931


# 87b64663 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Consolidate page busy sleep mechanisms

- Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take
a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix
up callers.
- Modify vm_page_busy_sleep() to return a status indicating whether the
object lock was dropped, and fix up callers.
- Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep()
instead.
- Remove vm_page_sleep_if_(x)busy().

No functional change intended.

Obtained from: jeff (object_concurrency patches)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32947


# 350fc36b 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

sysctl vm.objects: yield if hog

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31163


# 7738118e 13-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

vm.objects_swap: disable reporting some information

For making the call faster, do not count active/inactive object queues,
and do not report vnode info if any (for tmpfs).

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31163


# 42812ccc 13-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

Add vm.swap_objects sysctl

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31163


# 1b610624 13-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_object_list: split sysctl handler in separate function

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31163


# 28bc23ab 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

tmpfs: dynamically register tmpfs pager

Remove OBJT_SWAP_TMPFS. Move tmpfs-specific swap pager bits into
tmpfs_subr.c.

There is no longer any code to directly support tmpfs in sys/vm, most
tmpfs knowledge is shared by non-anon swap object type implementation.
The tmpfs-specific methods are provided by registered tmpfs pager, which
inherits from the swap pager.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# b730fd30 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vm: Add KPI to dynamically register pagers

Pager is allowed to inherit part of its implementation from the existing
pager, which is done by copying non-NULL virtual method slots.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 7079449b 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

sys/vm: remove several other uses of OBJT_SWAP_TMPFS

Mostly in cases where OBJ_SWAP flag works as well, or by reversing the
condition so that object types can be listed.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 3e7a11ca 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_object_set_memattr(): handle all object types without listing them explicitly

This avoids the need to know all existing object types in advance, by the
cost of loosing the assert that unknown object type is handled in a sane
manner.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 00a3fe96 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_object_kvme_type(): reimplement by embedding kvme_type into pagerops

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 4b8365d7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add OBJT_SWAP_TMPFS pager

This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.

Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.

This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# a7c198a2 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Implement vm_object_vnode() using vm_pager_getvp()

Allow vp_heldp argument to be NULL, in which case the returned vnode
is not held for tmpfs swap objects.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 1390a5cb 01-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_freespace method

Makes the code in vm_object collapse/page_remove cleaner

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# c23c555b 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_mightbedirty method

Used to implement vm_object_mightbedirty()

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 180bcaa4 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_pager: add pgo_set_writeable_dirty method

specialized for swap and vnode pagers, and used to implement
vm_object_set_writeable_dirty().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# ecfbddf0 14-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

sysctl vm.objects: report backing object and swap use

For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object. This allows userspace
to deconstruct the shadow chain. Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field. I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29771


# a720b31c 08-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Allow consumer to customize physical pager.

Add support for user-supplied callbacks into phys pager operations,
providing custom getpages(), haspage(), and populate() methods
implementations. Pager stores user data ptr/val in the object to
provide context.

Add phys_pager_allocate() helper that takes user ops table as one of
the arguments.

Current code for these methods is moved to the 'default' ops table,
assigned automatically when vm_pager_alloc() is used.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# aec9e7d8 07-Sep-2020 Mark Johnston <markj@FreeBSD.org>

vm_object_split(): Handle orig_object type changes.

orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while
vm_object_split() is sleeping. In this case some pages in new_object
may be left unbusied, but vm_object_split() attempts to unbusy all of
them.

Track the beginning of the busied range. Add an assertion to verify
that pages are not re-added to the source object while sleeping.

Reported by: Olympios Petrakis <olympios.petrakis@netapp.com>
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26223


# c3aa3bf9 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: clean up empty lines in .c and .h files


# feabaaf9 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the always curthread argument from reverse lookup routines

Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by: pho


# ffae7ea9 16-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

vm_object: allow paging_in_progress to be acquired after object termination.

The vm objects are type-stable, and can be accessed even after the
last reference is dropped, or in case of vnode objects, after vgone()
destroyed it as well.

Stop asserting that pip == 0 after vm_object_terminate() waited for
existing owners to drop it, we only want to drain them before setting
OBJ_DEAD flag. Also stop asserting pip == 0 in object destructor.

Update comments explaining the interaction between paging_in_progress
and termination.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968


# 84242cf6 25-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Call swap_pager_freespace() from vm_object_page_remove().

All vm_object_page_remove() callers, except
linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when
removing a range of pages from an object. The LinuxKPI case appears to
be an unintentional omission that could result in leaked swap blocks, so
unconditionally free swap space in vm_object_page_remove() to protect
against similar bugs in the future.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25329


# cdd02f43 19-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Revert r362360.

This commit was simply wrong since two different objects are locked.

Reported by: lwhsu, pho
Pointy hat: markj


# 61b00688 18-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Fix a double object unlock in vm_object_backing_collapse_wait().

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25327


# 8cc8c586 12-Jun-2020 Eric van Gyzen <vangyzen@FreeBSD.org>

Honor db_pager_quit in some vm_object ddb commands

These can be rather verbose.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon


# d869a17e 06-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23978


# 1a0c234e 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Simplify vref() code in object_reference. The local temporary is no longer
necessary. Fix formatting errors.

Reported by: mjg
Discussed with: kib


# c99d0c58 28-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Add a blocking counter KPI.

refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter. However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.

This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.

Reviewed by: jeff, kib, mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23723


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# eaa17d42 22-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

sys/vm: quiet -Wwrite-strings

Discussed with: kib
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23796


# 8d34a3bf 04-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Enable vm_object_mightbedirty() and vm_object_page_clean() for swap
objects backing tmpfs vnodes data.

The clean scan is limited to only remove write permissions from the
mapped pages of the objects. This fixes the issue that tmpfs vnode
mtime is not updated from writes to the mmaped area after the initial
page-in.

Noted by: mjg
Reviewed by: markj
Discussed with: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23432


# cd0047f3 24-Jan-2020 Konstantin Belousov <kib@FreeBSD.org>

Handle a race of collapse with a retrying fault.

Both vm_object_scan_all_shadowed() and vm_object_collapse_scan() might
observe an invalid page left in the default backing object by the
fault handler that retried. Check for the condition and refuse to collapse.

Reported and tested by: pho
Reviewed by: jeff
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D23331


# 98087a06 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Make collapse synchronization more explicit and allow it to complete during
paging.

Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object. This gives us an explicit test rather than overloading
paging-in-progress. While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.

Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain. For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse. It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.

This change makes collapse more reliable as a secondary benefit. The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.

This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908


# 58447749 16-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix a long standing bug that was made worse in r355765. When we are cowing a
page that was previously mapped read-only it exists in pmap until pmap_enter()
returns. However, we held no reference to the original page after the copy
was complete. This allowed vm_object_scan_all_shadowed() to collapse an
object that still had pages mapped. To resolve this, add another page pointer
to the faultstate so we can keep the page xbusy until we're done with
pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done
in vm_object_collapse_scan().

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23155


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 9f5632e6 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking for queue operations.

With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885


# 3cf3b4e6 21-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Make page busy state deterministic on free. Pages must be xbusy when
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.

Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.

Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822


# 4bf95d00 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Previously we did not support invalid pages in default objects. This means
that if fault fails to progress and needs to restart the loop it must free
the page it is working on and allocate again on restart. Resolve the few
places that need to be modified to support this condition and simply
deactivate the page. Presently, we only permit this when fault restarts
for busy contention. This has an added benefit of removing some object
trylocking in this case.

While here consolidate some page cleanup logic into fault_page_free() and
fault_page_release() to reduce redundant code and automate some teardown.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22653


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# d8ad7b7d 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Do not assert that the object lock is held in vm_object_set_writeable_dirty.
A valid reference is all that is required. If we race with a deallocation
we will harmlessly misidentify the type of an already dead object.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22636


# 67388836 01-Dec-2019 Konstantin Belousov <kib@FreeBSD.org>

Store the bottom of the shadow chain in OBJ_ANON object->handle member.

The handle value is stable for all shadow objects in the inheritance
chain. This allows to avoid descending the shadow chain to get to the
bottom of it in vm_map_entry_set_vnode_text(), and eliminate
corresponding object relocking which appeared to be contending.

Change vm_object_allocate_anon() and vm_object_shadow() to handle more
of the cred/charge initialization for the new shadow object, in
addition to set up the handle.

Reported by: jeff
Reviewed by: alc (previous version), jeff (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differrential revision: https://reviews.freebsd.org/D22541


# 26c4e983 29-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Fix a perf regression from r355122. We can use a shared lock to drop the
last ref on vnodes.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22565


# a67d5408 26-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Use atomics in more cases for object references. We now can completely
omit the object lock if we are above a certain threshold. Hold only a
single vnode reference when the vnode object has any ref > 0. This
allows us to only lock the object and vnode on 0-1 and 1-0 transitions.

Differential Revision: https://reviews.freebsd.org/D22452


# 6a14746c 25-Nov-2019 Ryan Libby <rlibby@FreeBSD.org>

vm_object_collapse_scan_wait: drop locks before reacquiring

Regression from r352174. In the vm_page_rename() failure case we forgot
to unlock the vm object locks before sleeping and reacquiring them.

Reviewed by: jeff
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22542


# 32362449 24-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Ignore object->handle for OBJ_ANON objects.

Note that the change in vm_object_collapse() is arguably a correctness
fix. We must not collapse into content-identity carrying objects.

Reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22467


# 51b867e5 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Only keep anonymous objects on shadow lists. This eliminates locking of
globally visible objects when they are part of a backing chain.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22423


# 63967687 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates
reudundant complicated checks and additional locking required only for
anonymous memory. Introduce vm_object_allocate_anon() to create these
objects. DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.

Reviewed by: alc, dougm, kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22119


# 67d0e293 29-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116


# 51df5321 29-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

Use atomics and a shared object lock to protect the object reference count.

Certain consumers still need to guarantee a stable reference so we can not
switch entirely to atomics yet. Exclusive lock holders can still modify
and examine the refcount without using the ref api.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21598


# fff5403f 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the
object lock in vm_page_set_validclean().

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21595


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 205be21d 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.

This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592


# 8da1c098 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep().

This persists busy state across operations like rename and replace.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21549


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# 2288078c 08-Oct-2019 Doug Moore <dougm@FreeBSD.org>

Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882


# 87e93ea6 27-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix object locking in vm_object_unwire() after r352174.

Now, vm_page_busy_sleep() expects the page's object to be locked.
vm_object_unwire() does some unusual lazy locking of the object chain
and keeps objects locked until a busy page is encountered or the loop
terminates. When a busy page is encountered, rather than unlocking all
but the "bottom-level" object, we must instead skip the object to which
"tm" belongs.

Reported and tested by: pho
Reviewed by: kib
Discussed with: jeff
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21790


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# 11b57401 12-Sep-2019 Hans Petter Selasky <hselasky@FreeBSD.org>

Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision: https://reviews.freebsd.org/D21620
Reviewed by: kib@, markj@
MFC after: 1 week
Sponsored by: Mellanox Technologies


# c7575748 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.

Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546


# 4cdea4a8 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 0e79619e 07-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

vm_object_deallocate(): Remove no longer needed code.

We track text mappings explicitly, there is no removal of the text
refs on the object deallocate any more, so tmpfs objects should not be
treated specially. Doing so causes excess deref.

Reported and tested by: gallatin
Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21560


# 4f49b435 07-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

vm_object_coalesce(): avoid extending any nosplit objects, not only
that which back tmpfs nodes.

Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21560


# 2614bd96 28-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vm: only lock tmpfs vnode shared in vm_object_deallocate

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21455


# 783a68aa 25-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Move OBJT_VNODE specific code from vm_object_terminate() to
vnode_destroy_vobject().

Reviewed by: alc, jeff (previous version), markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21357


# cf27e0d1 19-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by: markj
Tested by: pho (as part of a larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21311


# eeacb3b0 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Merge the vm_page hold and wire mechanisms.

The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended. __FreeBSD_version is bumped.

Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247


# 0fd977b3 26-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a return value to vm_page_remove().

Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758


# d842aa51 01-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a vm_page_wired() predicate.

Use it instead of accessing the wire_count field directly. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485


# 12487941 06-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Noted by: alc
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 6 days


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 7f144605 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Do not collapse objects with OBJ_NOSPLIT backing swap object.

NOSPLIT swap objects are not anonymous, they are used by tmpfs regular
files and POSIX shared memory. For such objects, collapse is not
permitted.

Reported by: mjg
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19923


# 5e38e3f5 29-Nov-2018 Eric van Gyzen <vangyzen@FreeBSD.org>

Include path for tmpfs objects in vm.objects sysctl

This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.

Reviewed by: kib, dab
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D2724


# 0951bd36 29-Nov-2018 Eric van Gyzen <vangyzen@FreeBSD.org>

Add assertions and comment to vm_object_vnode()

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D2724


# 0e48e068 10-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Re-apply r336984, reverting r339934.

r336984 exposed the bug fixed in r340241, leading to the initial revert
while the bug was being hunted down. Now that the bug is fixed, we
can revert the revert.

Discussed with: alc
MFC after: 3 days


# 5d277a85 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Revert r336984.

It appears to be responsible for random segfaults observed when lots
of paging activity is taking place, but the root cause is not yet
understood.

Requested by: alc
MFC after: now


# 4c29d2de 23-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Refactor domainset iterators for use by malloc(9) and UMA.

Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc". The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy. Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.

In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations. To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.

Reviewed by: jeff (previous version)
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17417


# 005783a0 31-Jul-2018 Alan Cox <alc@FreeBSD.org>

Allow vm object coalescing to occur in the midst of a vm object when the
OBJ_ONEMAPPING flag is set. In other words, allow recycling of existing
but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set.

Such situations are increasingly common with jemalloc >= 5.0. This
change has the expected effect of reducing the number of vm map entry and
object allocations and increasing the number of superpage promotions.

Reviewed by: kib, markj
Tested by: pho
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16501


# 1b5c869d 04-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix some races introduced in r332974.

With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280


# 5cd29d0f 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Improve VM page queue scalability.

Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# 20e4afbf 04-Feb-2018 Konstantin Belousov <kib@FreeBSD.org>

On munlock(), unwire correct page.

It is possible, for complex fork()/collapse situations, to have
sibling address spaces to partially share shadow chains. If one
sibling performs wiring, it can happen that a transient page, invalid
and busy, is installed into a shadow object which is visible to other
sibling for the duration of vm_fault_hold(). When the backing object
contains the valid page, and the wiring is performed on read-only
entry, the transient page is eventually removed.

But the sibling which observed the transient page might perform the
unwire, executing vm_object_unwire(). There, the first page found in
the shadow chain is considered as the page that was wired for the
mapping. It is really the page below it which is wired. So we unwire
the wrong page, either triggering the asserts of breaking the page'
wire counter.

As the fix, wait for the busy state to finish if we find such page
during unwire, and restart the shadow chain walk after the sleep.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14184


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 11542376 25-Dec-2017 Alan Cox <alc@FreeBSD.org>

Make the vm object bypass and collapse counters per CPU.

Requested by: mjg
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13611


# 796df753 30-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

SPDX: Consider code from Carnegie-Mellon University.

Interesting cases, most likely from CMU Mach sources.


# 2e47807c 28-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.

The arena argument to kmem_*() is now only used in an assert. A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA. When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA. On 32bit architectures this should behave much more
gracefully as we exhaust KVA. On 64bit the limits are likely never hit.

Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13187


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 8d6fbbb8 07-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 4074d642 18-Oct-2017 Alan Cox <alc@FreeBSD.org>

Batch atomic updates to the number of active, inactive, and laundry
pages by vm_object_terminate_pages(). For example, for a "buildworld"
workload, this batching reduces vm_object_terminate_pages()'s average
execution time by 12%. (The total savings were about 11.7 billion
processor cycles.)

Reviewed by: kib
MFC after: 1 week


# cf060942 28-Sep-2017 Alan Cox <alc@FreeBSD.org>

Optimize vm_object_page_remove() by eliminating pointless calls to
pmap_remove_all(). If the object to which a page belongs has no
references, then that page cannot possibly be mapped.

Reviewed by: kib
MFC after: 1 week


# 5bf94937 19-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

For unlinked files, do not msync(2) or sync on the vnode deactivation.

One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411


# bba52eca 15-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Batch freeing of the pages in vm_object_page_remove() under the same
free queue mutex lock owning session, same as it was done for the
object termination in r323561.

Reported and tested by: mjg
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2fcd1ff6 13-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not relock free queue mutex for each page, free whole terminating
object' page queue under the single mutex lock.

First, all pages on the queue are prepared for free by calls to
vm_page_free_prep(), and pages which should not be returned to the
physical allocator (e.g. wired or fictitious) are simply removed from
the queue. On the second pass, vm_page_free_phys_pglist() inserts all
pages from the queue without relocking the mutex.

The change improves the object termination, e.g. on the process exit
where large anonymous memory objects otherwise cause relocks the free
queue mutex for each page. More, if several such processes are
exiting or execing in parallel, the mutex was highly contended on
the address space demolition.

Diagnosed and tested by: mjg (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 93c5d3a4 09-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Add a vm_page_change_lock() helper, the common code to not relock page
lock if both old and new pages use the same underlying lock. Convert
existing places to use the helper instead of inlining it. Use the
optimization in vm_object_page_remove().

Suggested and reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f425ab8e 25-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Replace global swhash in swap pager with per-object trie to track swap
blocks assigned to the object pages.

- The global swhash_mtx is removed, trie is synchronized by the
corresponding object lock.
- The swp_pager_meta_free_all() function used during object
termination is optimized by only looking at the trie instead of
having to search whole hash for the swap blocks owned by the object.
- On swap_pager_swapoff(), instead of iterating over the swhash,
global object list have to be inspected. There, we have to ensure
that we do see valid trie content if we see that the object type is
swap.
Sizing of the swblk zone is same as for swblock zone, each swblk maps
SWAP_META_PAGES pages.

Proposed by: alc
Reviewed by: alc, markj (previous version)
Tested by: alc, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D11435


# 7bbdb843 16-Aug-2017 Ruslan Bukin <br@FreeBSD.org>

Add OBJ_PG_DTOR flag to VM object.

Setting this flag allows us to skip pages removal from VM object queue
during object termination and to leave that for cdev_pg_dtor function.

Move pages removal code to separate function vm_object_terminate_pages()
as comments does not survive indentation.

This will be required for Intel SGX support where we will have to remove
pages from VM object manually.

Reviewed by: kib, alc
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11688


# 0ecee546 22-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not allocate struct kinfo_vmobject on stack.

Its size is 1184 bytes.

Noted by: eugen
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cd1241fb 19-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Add pctrie_init() and vm_radix_init() to initialize generic pctrie and
vm_radix trie.

Existing vm_radix_init() function is renamed to vm_radix_zinit().
Inlines moved out of the _ headers.

Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11661


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# 52d1adda 15-Mar-2017 Alan Cox <alc@FreeBSD.org>

Relax the locking requirements for vm_object_page_noreuse(). While
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.

Reviewed by: kib, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D10011


# d1780e8d 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# aa3650ea 30-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Avoid page lookups in the top-level object in vm_object_madvise().

We can iterate over consecutive resident pages in the top-level object
using the object's page list rather than by performing lookups in the
object radix tree. This extends one of the optimizations in r312208 to the
case where a shadow chain is present.

Suggested by: alc
Reviewed by: alc, kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9282


# c2655a40 14-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Avoid unnecessary page lookups in vm_object_madvise().

vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.

While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9098


# 77d6fd97 18-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Improve vm_object_scan_all_shadowed() to also check swap backing objects.

As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index. Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.

Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero. When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.

Suggested and reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 3453bca8 12-Dec-2016 Alan Cox <alc@FreeBSD.org>

Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system. Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8752


# 563a19d5 01-Dec-2016 Alan Cox <alc@FreeBSD.org>

During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was
called to allocate a new page of radix trie nodes, there could be a call to
vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress
vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify
vm_radix_insert() and vm_radix_remove() by removing the flags on the root of
the trie that were used to detect this case and the code for restarting
vm_radix_insert() when it happened.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8664


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# ebcddc72 09-Nov-2016 Alan Cox <alc@FreeBSD.org>

Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302


# 5975e53d 13-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix a race in vm_page_busy_sleep(9).

Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196


# 411455a8 10-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after: 1 month


# 19efd8a5 11-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 9f790a17 29-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Do not leak the vm object lock when swap reservation failed, in
vm_object_coalesce().

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# aa9bc3b1 26-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Prevent parallel object collapses. Both vm_object_collapse_scan() and
swap_pager_copy() might unlock the object, which allows the parallel
collapse to execute. Besides destroying the object, it also might
move the reference from parent to the backing object, firing the
assertion ref_count == 1.

Collapses are prevented by bumping paging_in_progress counters on both
the object and its backing object.

Reported by: cem
Tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D6085


# 98f139da 26-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Style changes to some most outrageous violations in vm_object_collapse().

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2a339d9e 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.

Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# 1bdbd705 28-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Implement process-shared locks support for libthr.so.3, without
breaking the ABI. Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by: vangyzen (previous version)
Discussed with: deischen, emaste, jhb, rwatson,
Martin Simmons <martin@lispworks.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation


# b0cd2017 16-Dec-2015 Gleb Smirnoff <glebius@FreeBSD.org>

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 4cc8daf7 03-Dec-2015 Conrad Meyer <cem@FreeBSD.org>

Pull vm_object_scan_all_shadowed out of vm_object_backing_scan

These two functions were largely unrelated, they just used the same same
loop logic to walk through a backing object's memq. Pull out the
all_shadowed test as its own function and eliminate
OBSC_TEST_ALL_SHADOWED. Rename vm_object_backing_scan to
vm_object_collapse_scan.

No functional change.

Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4335


# 99a1570a 01-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

r221714 fixed the situation when the collapse scan improperly handled
invalid (busy) page supposedly inserted by the vm_fault(), in the
OBSC_COLLAPSE_NOWAIT case. As a continuation to r221714, fix a case
when invalid page is found by the object scan in OBSC_COLLAPSE_WAIT
case as well. But, since this is waitable scan, we should wait for
the termination of the busy state and restart from the beginning of
the backing object' page queue. [*]

Do not free the shadow page swap space when the parent page is
invalid, otherwise this action potentially corrupts user data.

Combine all instances of the collapse scan sleep code fragments into
the new helper vm_object_backing_scan_wait().

Improve style compliance and comments. Change the return type of
vm_object_backing_scan() to bool.

Initial submission by: cem, https://reviews.freebsd.org/D4103 [*]
Reviewed by: alc, cem
Tested by: cem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D4146


# 3138cd36 30-Sep-2015 Mark Johnston <markj@FreeBSD.org>

As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726


# 6195b24a 25-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Revert r173708's modifications to vm_object_page_remove().

Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF. Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.

The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.

The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire. This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.

Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.

Reported by: trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 093c7f39 12-Jun-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 63e4c6cd 02-Jun-2015 Eric van Gyzen <vangyzen@FreeBSD.org>

Provide vnode in memory map info for files on tmpfs

When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)


# ff87ae35 27-May-2015 John Baldwin <jhb@FreeBSD.org>

Export a list of VM objects in the system via a sysctl. The list can be
examined via 'vmstat -o'. It can be used to determine which files are
using physical pages of memory and how much each is using.

Differential Revision: https://reviews.freebsd.org/D2277
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc. (forward porting to HEAD/10)


# e735691b 08-May-2015 John Baldwin <jhb@FreeBSD.org>

Place VM objects on the object list when created and never remove them.
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.

Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero. This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)

Differential Revision: https://reviews.freebsd.org/D2423
Reviewed by: alc, kib (earlier version)
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc.


# 6a24058f 06-Mar-2015 Alan Cox <alc@FreeBSD.org>

Correct a typo in vm_object_backing_scan() that originated in r254141.
Specifically, change a lock acquire into a lock release.

MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division


# 777a36c5 28-Feb-2015 Alan Cox <alc@FreeBSD.org>

Use RW_NEW rather than calling bzero().


# f40cb1c6 28-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 30d57414 05-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

When the last reference on the vnode' vm object is dropped, read the
vp->v_vflag without taking vnode lock and without bypass. We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle. Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping. In this case,
the execve() code manages the VV_TEXT flag on its own already.

It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.

Reported by: glebius, scottl
Tested by: glebius, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 03462509 26-Jul-2014 Alan Cox <alc@FreeBSD.org>

When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.

Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division


# 4bace8e7 24-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Correct assertion. The shadowing object cannot be tmpfs vm object,
and tmpfs object cannot shadow. In other words, tmpfs vm object is
always at the bottom of the shadow chain.

Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f08f7dca 14-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs
vnode for the tmpfs node owning this object. The flag is currently
used for two purposes. First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems. Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.

The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.

Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 2309fa9b 12-Mar-2014 Konstantin Belousov <kib@FreeBSD.org>

Do not vdrop() the tmpfs vnode until it is unlocked. The hold
reference might be the last, and then vdrop() would free the vnode.

Reported and tested by: bdrewery
MFC after: 1 week


# 14a5dc17 13-Feb-2014 Attilio Rao <attilio@FreeBSD.org>

Fix-up r254141: in the process of making a failing vm_page_rename()
a call of pager_swap_freespace() was moved around, now leading to freeing
the incorrect page because of the pindex changes after vm_page_rename().

Get back to use the correct pindex when destroying the swap space.

Sponsored by: EMC / Isilon storage division
Reported by: avg
Tested by: pho
MFC after: 7 days


# 5f3563b0 12-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix function name in KASSERT().

Submitted by: hiren


# 9ded9474 04-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not coalesce if the swap object belongs to tmpfs vnode. The
coalesce would extend the object to keep pages for the anonymous
mapping created by the process. The pages has no relations to the
tmpfs file content which could be written into the corresponding
range, causing anonymous mapping and file content aliasing and
subsequent corruption.

Another lesser problem created by coalescing is over-accounting on the
tmpfs node destruction, since the object size is substracted from the
total count of the pages owned by the tmpfs mount.

Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3aaea6ef 08-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# e946b949 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 5df87b21 07-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# ebf5d94e 10-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Never remove user-wired pages from an object when doing
msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.

Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 5f518366 27-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
of usenix 2001. The NetBSD version was heavily refactored for bugs
and simplicity.
- Use this resource allocator to allocate the buffer and transient maps.
Buffer cache defrags are reduced by 25% when used by filesystems with
mixed block sizes. Ultimately this may permit dynamic buffer cache
sizing on low KVA machines.

Discussed with: alc, kib, attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 2051980f 09-Jun-2013 Alan Cox <alc@FreeBSD.org>

Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by: EMC / Isilon Storage Division


# dfd55c0c 04-Jun-2013 Attilio Rao <attilio@FreeBSD.org>

In vm_object_split(), busy and consequently unbusy the pages only when
swap_pager_copy() is invoked, otherwise there is no reason to do so.
This will eliminate the necessity to busy pages most of the times.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 7560005c 30-May-2013 Konstantin Belousov <kib@FreeBSD.org>

After the object lock was dropped, the object' reference count could
change. Retest the ref_count and return from the function to not
execute the further code which assumes that ref_count == 1 if it is
not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS
flag meantime.

Reported by: bdrewery
Tested by: bdrewery, pho
Sponsored by: The FreeBSD Foundation


# 782d4a63 30-May-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the capitalization in the assertion message. Print the address
of the object to get useful information from optimizated kernels dump.


# 6f2af3fc 28-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering. Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 month


# e5f299ff 28-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode'
v_object of non OBJT_VNODE type.

For vm_object_page_clean(), simply do not assert that object type must
be OBJT_VNODE, and add a comment explaining how the check for
OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such
objects.

For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it
to be for swap pager (or default), handle the bypass filesystems, and
correctly acquire the object reference in this case.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 week


# 774d251d 17-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# c9341161 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide


# 198da1b2 04-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmcontention:
As vm objects are type-stable there is no need to initialize the
resident splay tree pointer and the cache splay tree pointer in
_vm_object_allocate() but this could be done in the init UMA zone
handler.

The destructor UMA zone handler, will further check if the condition is
retained at every destruction and catch for bugs.

Sponsored by: EMC / Isilon storage division
Submitted by: alc


# 55f33f2c 02-Mar-2013 Alan Cox <alc@FreeBSD.org>

The value held by the vm object's field pg_color is only considered
valid if the flag OBJ_COLORED is set. Since _vm_object_allocate()
doesn't set this flag, it needn't initialize pg_color.

Sponsored by: EMC / Isilon Storage Division


# a4915c21 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


# 64a3476f 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Remove white spaces.

Sponsored by: EMC / Isilon storage division


# 0dde287b 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 28634820 08-Dec-2012 Alan Cox <alc@FreeBSD.org>

In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages. In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by: kib


# 96b0b92a 28-Nov-2012 Alan Cox <alc@FreeBSD.org>

Add support for the (relatively) new object type OBJT_MGTDEVICE to
vm_object_set_memattr(). Also, add a "safety belt" so that
vm_object_set_memattr() doesn't silently modify undefined object types.

Reviewed by: kib
MFC after: 10 days


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 9af47af6 13-Oct-2012 Alan Cox <alc@FreeBSD.org>

Eliminate the conditional for releasing the page queues lock in
vm_page_sleep(). vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held. These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after: 3 weeks


# 877d24ac 28-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


# 5f9c767b 20-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Plug the accounting leak for the wired pages when msync(MS_INVALIDATE)
is performed on the vnode mapping which is wired in other address space.

While there, explicitely assert that the page is unwired and zero the
wire_count instead of substract. The condition is rechecked later in
vm_page_free(_toq) already.

Reported and tested by: zont
Reviewed by: alc (previous version)
MFC after: 1 week


# 571a1e92 10-Jul-2012 Attilio Rao <attilio@FreeBSD.org>

Document the object type movements, related to swp_pager_copy(),
in vm_object_collapse() and vm_object_split().

In collabouration with: alc
MFC after: 3 days


# 92a59946 19-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms. While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.


# 126d6082 17-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks


# e65919f9 04-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not restart the scan in vm_object_page_clean() on the object
generation change if requested mode is async. The object generation is
only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async
mode it is enough to write each dirty page, not to make a guarantee that
all pages are cleared after the vm_object_page_clean() returned.

Diagnosed by: truckman
Tested by: flo
Reviewed by: alc, truckman
MFC after: 2 weeks


# b5f359b7 28-Dec-2011 Alan Cox <alc@FreeBSD.org>

Optimize vm_object_split()'s handling of reservations.


# 75ff604a 23-Dec-2011 Konstantin Belousov <kib@FreeBSD.org>

Optimize the common case of msyncing the whole file mapping with
MS_SYNC flag. The system must guarantee that all writes are finished
before syscalls returned. Schedule the writes in async mode, which is
much faster and allows the clustering to occur. Wait for writes using
VOP_FSYNC(), since we are syncing the whole file mapping.

Potentially, the restriction to only apply the optimization can be
relaxed by not requiring that the mapping cover whole file, as it is
done by other OSes.

Reported and tested by: az
Reviewed by: alc
MFC after: 2 weeks


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 936c09ac 03-Nov-2011 John Baldwin <jhb@FreeBSD.org>

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# d98d0ce2 09-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag
to VPO_UNMANAGED (and also making the flag protected by the vm object
lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
VPO_UNMANAGED. As a consequence, pmap code now can use use just
VPO_UNMANAGED to decide whether the page is unmanaged.

Reviewed by: alc
Tested by: pho (x86, previous version), marius (sparc64),
marcel (arm, ia64, powerpc), ray (mips)
Sponsored by: The FreeBSD Foundation
Approved by: re (bz)


# 6bbee8e2 29-Jun-2011 Alan Cox <alc@FreeBSD.org>

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


# 031ec8c1 01-Jun-2011 Konstantin Belousov <kib@FreeBSD.org>

In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().

VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.

The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.

Reported and tested by: gavin
Reviewed by: alc, rmacklem
MFC after: 1 week


# e18cc7bf 09-May-2011 Max Laier <mlaier@FreeBSD.org>

Another long standing vm bug found at Isilon:
Fix a race between vm_object_collapse and vm_fault.

Reviewed by: alc@
MFC after: 3 days


# 86769ac0 23-Apr-2011 Konstantin Belousov <kib@FreeBSD.org>

Fix two bugs in r218670.

Hold the vnode around the region where object lock is dropped, until
vnode lock is acquired.

Do not drop the vnode reference for a case when the object was
deallocated during unlock. Note that in this case, VV_TEXT is cleared
by vnode_pager_dealloc().

Reported and tested by: pho
Reviewed by: alc
MFC after: 3 days


# 03fa5b34 13-Feb-2011 Konstantin Belousov <kib@FreeBSD.org>

Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note
mentioning that vnode lock is needed.

Reviewed by: alc
Tested by: pho
MFC after: 1 week


# 17f3095d 05-Feb-2011 Alan Cox <alc@FreeBSD.org>

Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean(). They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by: kib
MFC after: 3 weeks


# 0cc74f14 04-Feb-2011 Alan Cox <alc@FreeBSD.org>

Since the last parameter to vm_object_shadow() is a vm_size_t and not a
vm_pindex_t, it makes no sense for its callers to perform atop(). Let
vm_object_shadow() do that instead.


# c6c9025b 15-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

For consistency, use kernel_object instead of &kernel_object_store
when initializing the object mutex. Do the same for kmem_object.

Discussed with: alc
MFC after: 1 week


# edf93b25 01-Jan-2011 Alan Cox <alc@FreeBSD.org>

Make a couple refinements to r216799 and r216810. In particular, revise
a comment and move it to its proper place.

Reviewed by: kib


# 50cfe7fa 29-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by: alc (as a part of the bigger patch)
MFC after: 1 month (after r216799 is merged)


# 3280870d 28-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Move the increment of vm object generation count into
vm_object_set_writeable_dirty().

Fix an issue where restart of the scan in vm_object_page_clean() did
not removed write permissions for newly added pages or, if the mapping
for some already scanned page changed to writeable due to fault.
Merge the two loops in vm_object_page_clean(), doing the remove of
write permission and cleaning in the same loop. The restart of the
loop then correctly downgrade writeable mappings.

Fix an issue where a second caller to msync() might actually return
before the first caller had actually completed flushing the
pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not
before.

Calls to pmap_is_modified() are not needed after pmap_remove_write()
there.

Proposed, reviewed and tested by: alc
MFC after: 1 week


# ef694c1a 02-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object". This is required to make it possible to account
for per-jail swap usage.

Reviewed by: kib@
Tested by: pho@
Sponsored by: FreeBSD Foundation


# 780636b7 23-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

After the sleep caused by encountering a busy page, relookup the page.

Submitted and reviewed by: alc
Reprted and tested by: pho
MFC after: 5 days


# 3157c503 21-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Eliminate the mab, maf arrays and related variables.

The change also fixes off-by-one error in the calculation of mreq.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 5 days


# 17ea6f00 20-Nov-2010 Alan Cox <alc@FreeBSD.org>

Optimize vm_object_terminate().

Reviewed by: kib
MFC after: 1 week


# 4c7b9a20 20-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

The runlen returned from vm_pageout_flush() might be zero legitimately,
when mreq page has status VM_PAGER_AGAIN.

MFC after: 5 days


# 1e8a675c 18-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

vm_pageout_flush() might cache the pages that finished write to the
backing storage. Such pages might be then reused, racing with the
assert in vm_object_page_collect_flush() that verified that dirty
pages from the run (most likely, pages with VM_PAGER_AGAIN status) are
write-protected still. In fact, the page indexes for the pages that
were removed from the object page list should be ignored by
vm_object_page_clean().

Return the length of successfully written run from vm_pageout_flush(),
that is, the count of pages between requested page and first page
after requested with status VM_PAGER_AGAIN. Supply the requested page
index in the array to vm_pageout_flush(). Use the returned run length
to forward the index of next page to clean in vm_object_page_clean().

Reported by: avg
Reviewed by: alc
MFC after: 1 week


# 4166faae 18-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Only increment object generation count when inserting the page into
object page list. The only use of object generation count now is a
restart of the scan in vm_object_page_clean(), which makes sense to do
on the page addition. Page removals do not affect the dirtiness of the
object, as well as manipulations with the shadow chain.

Suggested and reviewed by: alc
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 757216f3 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Several cleanups for the r209686:
- remove unused defines;
- remove unused curgeneration argument for vm_object_page_collect_flush();
- always assert that vm_object_page_clean() is called for OBJT_VNODE;
- move vm_page_find_least() into for() statement initial clause.

Submitted by: alc


# e239bb97 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Reimplement vm_object_page_clean(), using the fact that vm object memq
is ordered by page index. This greatly simplifies the implementation,
since we no longer need to mark the pages with VPO_CLEANCHK to denote
the progress. It is enough to remember the current position by index
before dropping the object lock.

Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused.
Garbage-collect vm.msync_flush_flags sysctl.

Suggested and reviewed by: alc
Tested by: pho


# b382c10a 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by: alc
Tested by: pho
MFC after: 1 week


# 567e51e1 24-May-2010 Alan Cox <alc@FreeBSD.org>

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# 452e6e0d 23-May-2010 Alan Cox <alc@FreeBSD.org>

MFC r208159
Add a comment about the proper use of vm_object_page_remove().


# b28c6ddb 20-May-2010 Alan Cox <alc@FreeBSD.org>

MFC r207306
Change vm_object_madvise() so that it checks whether the page is invalid
or unmanaged before acquiring the page queues lock. Neither of these
tests require that lock. Moreover, a better way of testing if the page
is unmanaged is to test the type of vm object. This avoids a pointless
vm_page_lookup().


# a1a95cd6 16-May-2010 Alan Cox <alc@FreeBSD.org>

Add a comment about the proper use of vm_object_page_remove().

MFC after: 1 week


# 1f93868d 13-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC elimination of several settings of PG_REFERENCED bit, that either
do not make sense or are harmful.

MFC r206761 (by alc):
Setting PG_REFERENCED on the requested page in swap_pager_getpages() is
either redundant or harmful, depending on the caller.

MFC r206768 (by alc):
In vm_object_backing_scan(), setting PG_REFERENCED on a page before
sleeping on that page is nonsensical.

MFC r206770 (by alc):
In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on
that page only makes sense if the advice is MADV_WILLNEED.

MFC r206801 (by alc):
There is no justification for vm_object_split() setting PG_REFERENCED on a
page that it is going to sleep on.


# 3c4a2440 08-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.


# 70721880 06-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate acquisitions of the page queues lock that are no longer needed.

Switch to a per-processor counter for the number of pages freed during
process termination.


# eb00b276 06-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate page queues locking around most calls to vm_page_free().


# 5ac59343 05-May-2010 Alan Cox <alc@FreeBSD.org>

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


# ac800a84 02-May-2010 Alan Cox <alc@FreeBSD.org>

Correct an error in r207410: Remove an unlock of a lock that is no longer
held.


# 7bec141b 30-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

push up dropping of the page queue lock to avoid holding it in vm_pageout_flush


# ad0c05da 30-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

don't call vm_pageout_flush with the page queue mutex held

Reported by: Michael Butler


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 6a2a3d73 27-Apr-2010 Alan Cox <alc@FreeBSD.org>

Change vm_object_madvise() so that it checks whether the page is invalid
or unmanaged before acquiring the page queues lock. Neither of these
tests require that lock. Moreover, a better way of testing if the page
is unmanaged is to test the type of vm object. This avoids a pointless
vm_page_lookup().

MFC after: 3 weeks


# 4b9dd5d5 18-Apr-2010 Alan Cox <alc@FreeBSD.org>

There is no justification for vm_object_split() setting PG_REFERENCED on a
page that it is going to sleep on. Eliminate it.

MFC after: 3 weeks


# b11b56b5 17-Apr-2010 Alan Cox <alc@FreeBSD.org>

In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on
that page only makes sense if the advice is MADV_WILLNEED. In that case,
the intention is to activate the page, so discouraging the page daemon
from reclaiming the page makes sense. In contrast, in the other cases,
MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage
the page daemon from reclaiming the page by setting PG_REFERENCED.

Wrap a nearby line.

Discussed with: kib
MFC after: 3 weeks


# aefea7f5 17-Apr-2010 Alan Cox <alc@FreeBSD.org>

In vm_object_backing_scan(), setting PG_REFERENCED on a page before
sleeping on that page is nonsensical. Doing so reduces the likelihood
that the page daemon will reclaim the page before the thread waiting in
vm_object_backing_scan() is reawakened. However, it does not guarantee
that the page is not reclaimed, so vm_object_backing_scan() restarts
after reawakening. More importantly, this muddles the meaning of
PG_REFERENCED. There is no reason to believe that the caller of
vm_object_backing_scan() is going to use (i.e., access) the contents of
the page. There is especially no reason to believe that an access is
more likely because vm_object_backing_scan() had to sleep on the page.

Discussed with: kib
MFC after: 3 weeks


# 2d63cbda 10-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r200770:
Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.


# 49e3050e 20-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


# 01381811 24-Jul-2009 John Baldwin <jhb@FreeBSD.org>

Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# 3153e878 12-Jul-2009 Alan Cox <alc@FreeBSD.org>

Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.

vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by: re (kib)


# 9b4d473a 28-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Eliminiate code duplication by calling vm_object_destroy()
from vm_object_collapse().

Requested and reviewed by: alc
Approved by: re (kensmith)


# 26f4eea5 23-Jun-2009 Alan Cox <alc@FreeBSD.org>

The bits set in a page's dirty mask are a subset of the bits set in its
valid mask. Consequently, there is no need to perform a bit-wise and of
the page's dirty and valid masks in order to determine which parts of a
page are dirty and valid.

Eliminate an unnecessary #include.


# 3364c323 23-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# 387aabc5 14-Jun-2009 Alan Cox <alc@FreeBSD.org>

Long, long ago in r27464 special case code for mapping device-backed
memory with 4MB pages was added to pmap_object_init_pt(). This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous. Unfortunately, this is not always the case. For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption. Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous. This
revision also changes some inconsistent if not buggy behavior. For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid. However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page. The amd64 version has a bug of
my own creation. It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.

To my knowledge, there have been no reports of these bugs, hence,
their persistance. I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms. However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity). These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)

Discussed with: jhb@


# a28042d1 28-May-2009 Alan Cox <alc@FreeBSD.org>

Change vm_object_page_remove() such that it clears the page's dirty bits
when it invalidates the page.

Suggested by: tegge


# bb2ac86f 23-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not call vm_page_lookup() from the ddb routine, namely from "show
vmopag" implementation. The vm_page_lookup() code modifies splay tree
of the object pages, and asserts that object lock is taken. First issue
could cause kernel data corruption, and second one instantly panics the
INVARIANTS-enabled kernel.

Take the advantage of the fact that object->memq is ordered by page index,
and iterate over memq to calculate the runs.

While there, make the code slightly more style-compliant by moving
variables declarations to the right place.

Discussed with: jhb, alc
Reviewed by: alc
MFC after: 2 weeks


# bfd9b137 21-Feb-2009 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock in vm_object_page_remove().

MFC after: 1 week


# 7b54b1a9 08-Feb-2009 Alan Cox <alc@FreeBSD.org>

Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a
redundant assertion in vm_fault().

Reviewed by: kib


# e9f54126 21-Dec-2008 Robert Noland <rnoland@FreeBSD.org>

Fix printing of KASSERT message missed in r163604.

Approved by: kib


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0d7935fd 10-Oct-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 6bd9cb1c 03-Aug-2008 Tom Rhodes <trhodes@FreeBSD.org>

Fill in a few sysctl descriptions.

Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by: alc


# 2c3b410b 30-Jul-2008 John Baldwin <jhb@FreeBSD.org>

One more whitespace nit.


# 3cca4b6f 30-Jul-2008 John Baldwin <jhb@FreeBSD.org>

A few more whitespace fixes.


# 2ac78f0e 20-May-2008 Stephan Uphoff <ups@FreeBSD.org>

Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo


# 52481a9a 29-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Use vm_object_reference_locked() directly from
vm_object_reference(). This is intended to get rid of vget()
consumers who don't wish to acquire a lock. This is functionally
the same as calling vref(). vm_object_reference_locked() already
uses vref.

Discussed with: alc


# 68855966 26-Feb-2008 Alan Cox <alc@FreeBSD.org>

Correct a long-standing error in vm_object_page_remove(). Specifically,
pmap_remove_all() must not be called on fictitious pages. To date,
fictitious pages have been allocated from zeroed memory, effectively
hiding this problem because the fictitious pages appear to have an empty
pv list. Submitted by: Kostik Belousov

Rewrite the comments describing vm_object_page_remove() to better
describe what it does. Add an assertion. Reviewed by: Kostik Belousov

MFC after: 1 week


# 4c8e0452 24-Feb-2008 Alan Cox <alc@FreeBSD.org>

Correct a long-standing error in vm_object_deallocate(). Specifically,
only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should
ever have OBJ_ONEMAPPING set. However, vm_object_deallocate() was
setting it on device (OBJT_DEVICE) objects. As a result,
vm_object_page_remove() could be called on a device object and if that
occurred pmap_remove_all() would be called on the device object's pages.
However, a device object's pages are fictitious, and fictitious pages do
not have an initialized pv list (struct md_page).

To date, fictitious pages have been allocated from zeroed memory,
effectively hiding this problem. Now, however, the conversion of rotting
diagnostics to invariants in the amd64 and i386 pmaps has revealed the
problem. Specifically, assertion failures have occurred during the
initialization phase of the X server on some hardware.

MFC after: 1 week
Discussed with: Kostik Belousov
Reported by: Michiel Boland


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# f8a47341 29-Dec-2007 Alan Cox <alc@FreeBSD.org>

Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)


# 59677d3c 17-Nov-2007 Alan Cox <alc@FreeBSD.org>

Prevent the leakage of wired pages in the following circumstances:
First, a file is mmap(2)ed and then mlock(2)ed. Later, it is truncated.
Under "normal" circumstances, i.e., when the file is not mlock(2)ed, the
pages beyond the EOF are unmapped and freed. However, when the file is
mlock(2)ed, the pages beyond the EOF are unmapped but not freed because
they have a non-zero wire count. This can be a mistake. Specifically,
it is a mistake if the sole reason why the pages are wired is because of
wired, managed mappings. Previously, unmapping the pages destroys these
wired, managed mappings, but does not reduce the pages' wire count.
Consequently, when the file is unmapped, the pages are not unwired
because the wired mapping has been destroyed. Moreover, when the vm
object is finally destroyed, the pages are leaked because they are still
wired. The fix is to reduce the pages' wired count by the number of
wired, managed mappings destroyed. To do this, I introduce a new pmap
function pmap_page_wired_mappings() that returns the number of managed
mappings to the given physical page that are wired, and I use this
function in vm_object_page_remove().

Reviewed by: tegge
MFC after: 6 weeks


# 25732691 18-Oct-2007 Alan Cox <alc@FreeBSD.org>

The previous revision, updating vm_object_page_remove() for the new page
cache, did not account for the case where the vm object has nothing but
cached pages.

Reported by: kris, tegge
Reviewed by: tegge
MFC after: 3 days


# c9444914 26-Sep-2007 Alan Cox <alc@FreeBSD.org>

Correct an error of omission in the reimplementation of the page
cache: vm_object_page_remove() should convert any cached pages that
fall with the specified range to free pages. Otherwise, there could
be a problem if a file is first truncated and then regrown.
Specifically, some old data from prior to the truncation might reappear.

Generalize vm_page_cache_free() to support the conversion of either a
subset or the entirety of an object's cached pages.

Reported by: tegge
Reviewed by: tegge
Approved by: re (kensmith)


# f3a2ed4b 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Correct an error in the previous revision, specifically,
vm_object_madvise() should request that the reactivated, cached page
not be busied.

Reported by: Rink Springer
Approved by: re (kensmith)


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# 2446e4f0 15-Jun-2007 Alan Cox <alc@FreeBSD.org>

Enable the new physical memory allocator.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re


# 393a081d 10-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Optimize vmmeter locking.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments

Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)


# b4b70819 04-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 19c244d0 27-Mar-2007 Alan Cox <alc@FreeBSD.org>

Prevent a race between vm_object_collapse() and vm_object_split() from
causing a crash.

Suppose that we have two objects, obj and backing_obj, where
backing_obj is obj's backing object. Further, suppose that
backing_obj has a reference count of two. One being the reference
held by obj and the other by a map entry. Now, suppose that the map
entry is deallocated and its reference removed by
vm_object_deallocate(). vm_object_deallocate() recognizes that the
only remaining reference is from a shadow object, obj, and calls
vm_object_collapse() on obj. vm_object_collapse() executes

if (backing_object->ref_count == 1) {
/*
* If there is exactly one reference to the backing
* object, we can collapse it into the parent.
*/
vm_object_backing_scan(object, OBSC_COLLAPSE_WAIT);

vm_object_backing_scan(OBSC_COLLAPSE_WAIT) executes

if (op & OBSC_COLLAPSE_WAIT) {
vm_object_set_flag(backing_object, OBJ_DEAD);
}

Finally, suppose that either vm_object_backing_scan() or
vm_object_collapse() sleeps releasing its locks. At this instant,
another thread executes vm_object_split(). It crashes in
vm_object_reference_locked() on the assertion that the object is not
dead. If, however, assertions are not enabled, it crashes much later,
after the object has been recycled, in vm_object_deallocate() because
the shadow count and shadow list are inconsistent.

Reviewed by: tegge
Reported by: jhb
MFC after: 1 week


# c5474b8f 22-Mar-2007 Alan Cox <alc@FreeBSD.org>

Change the order of lock reacquisition in vm_object_split() in order to
simplify the code slightly. Add a comment concerning lock ordering.


# 8db5fc58 27-Feb-2007 John Baldwin <jhb@FreeBSD.org>

Use pause() in vm_object_deallocate() to yield the CPU to the lock holder
rather than a tsleep() on &proc0. The only wakeup on &proc0 is intended
to awaken the swapper, not random threads blocked in
vm_object_deallocate().


# 9f5c801b 24-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the way that unmanaged pages are created. Specifically,
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage(). This allows for the elimination of some uses of
the page queues lock.

Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS. This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().

Remove vm_page_unmanage(). It is no longer used.


# 0cd31a0d 21-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the page's CLEANCHK flag from being a page queue mutex synchronized
flag to a vm object mutex synchronized flag.


# f67af5c9 17-Jan-2007 Xin LI <delphij@FreeBSD.org>

Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.


# 73000556 17-Dec-2006 Alan Cox <alc@FreeBSD.org>

Optimize vm_object_split(). Specifically, make the number of iterations
equal to the number of physical pages that are renamed to the new object
rather than the new object's virtual size.


# 95442adf 16-Dec-2006 Alan Cox <alc@FreeBSD.org>

Simplify the computation of the new object's size in vm_object_split().


# 2a53696f 22-Oct-2006 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_busy() or
vm_page_wakeup(). Reduce or eliminate its use accordingly.


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# b276ae6f 21-Aug-2006 Alan Cox <alc@FreeBSD.org>

Add _vm_stats and _vm_stats_misc to the sysctl declarations in sysctl.h and
eliminate their declarations from various source files.


# b146f9e5 12-Aug-2006 Alan Cox <alc@FreeBSD.org>

Reimplement the page's NOSYNC flag as an object-synchronized instead of a
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.

Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# 91449ce9 03-Aug-2006 Alan Cox <alc@FreeBSD.org>

When sleeping on a busy page, use the lock from the containing object
rather than the global page queues lock.


# 78985e42 01-Aug-2006 Alan Cox <alc@FreeBSD.org>

Complete the transition from pmap_page_protect() to pmap_remove_write().
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write(). However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.

The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm
trying to clean up and fix our support for execute access.

Reviewed by: marcel@ (ia64)


# 604c2bbc 22-Jul-2006 Alan Cox <alc@FreeBSD.org>

Export the number of object bypasses and collapses through sysctl.


# af51d7bf 21-Jul-2006 Alan Cox <alc@FreeBSD.org>

Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.


# 2e9f4a69 17-Jul-2006 Alan Cox <alc@FreeBSD.org>

Ensure that vm_object_deallocate() doesn't dereference a stale object
pointer: When vm_object_deallocate() sleeps because of a non-zero
paging in progress count on either object or object's shadow,
vm_object_deallocate() must ensure that object is still the shadow's
backing object when it reawakens. In fact, object may have been
deallocated while vm_object_deallocate() slept. If so, reacquiring
the lock on object can lead to a deadlock.

Submitted by: ups@
MFC after: 3 weeks


# 3b582b4e 02-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.


# ca95b514 21-Feb-2006 John Baldwin <jhb@FreeBSD.org>

Lock the vm_object while checking its type to see if it is a vnode-backed
object that requires Giant in vm_object_deallocate(). This is somewhat
hairy in that if we can't obtain Giant directly, we have to drop the
object lock, then lock Giant, then relock the object lock and verify that
we still need Giant. If we don't (because the object changed to OBJT_DEAD
for example), then we drop Giant before continuing.

Reviewed by: alc
Tested by: kris


# c05e22d4 01-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Install a temporary bandaid in vm_object_reference() that will stop
mtx_assert()s from triggering until I find a real long-term solution.


# 997e1c25 27-Jan-2006 Alan Cox <alc@FreeBSD.org>

Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)


# df59a0fe 25-Jan-2006 Jeff Roberson <jeff@FreeBSD.org>

- Avoid calling vm_object_backing_scan() when collapsing an object when
the resident page count matches the object size. We know it fully backs
its parent in this case.

Reviewed by: acl, tegge
Sponsored by: Isilon Systems, Inc.


# 02dd8331 22-Jan-2006 Alan Cox <alc@FreeBSD.org>

Make vm_object_vndeallocate() static. The external calls to it were
eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.


# ef39c05b 31-Dec-2005 Alexander Leidinger <netchild@FreeBSD.org>

MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)

MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)


# 8215781b 03-Dec-2005 Alan Cox <alc@FreeBSD.org>

Eliminate unneeded preallocation at initialization.

Reviewed by: tegge


# f6d89838 22-Oct-2005 Alan Cox <alc@FreeBSD.org>

Use of the ZERO_COPY_SOCKETS options can result in an unusual state that
vm_object_backing_scan() was not written to handle. Specifically, a wired
page within a backing object that is shadowed by a page within the shadow
object. Handle this state by removing the wired page from the backing
object. The wired page will be freed by socow_iodone().

Stop masking errors: If a page is being freed by vm_object_backing_scan(),
assert that it is no longer mapped rather than quietly destroying any
mappings.

Tested by: Harald Schmalzbauer


# 8dbca793 09-Aug-2005 Tor Egge <tegge@FreeBSD.org>

Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE
due to the vm object being locked.

When a process writes large amounts of data to a file, the vm object associated
with that file can contain most of the physical pages on the machine. If the
process is preempted while holding the lock on the vm object, pagedaemon would
be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE
to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to
other vm objects.

Temporarily unlock the page queues lock while locking vm objects to avoid lock
order violation. Detect and handle relevant page queue changes.

This change depends on both the lock portion of struct vm_object and normal
struct vm_page being type stable.

Reviewed by: alc


# b8a0b997 04-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- We need to inhert the OBJ_NEEDGIANT flag from the original object in
vm_object_split().

Spotted by: alc


# ed4fe4f4 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.


# c6ec6a7c 29-Mar-2005 Alan Cox <alc@FreeBSD.org>

Eliminate (now) unnecessary acquisition and release of the global page
queues lock in vm_object_backing_scan(). Updates to the page's PG_BUSY
flag and busy field are synchronized by the containing object's lock.

Testing the page's hold_count and wire_count in vm_object_backing_scan()'s
OBSC_COLLAPSE_NOWAIT case is unnecessary. There is no reason why the held
or wired pages cannot be migrated to the shadow object.

Reviewed by: tegge


# ee39666a 16-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't lock the vnode interlock in vm_object_set_writeable_dirty() if
we've already set the object flags.

Reviewed by: alc


# 8e99783b 30-Jan-2005 Alan Cox <alc@FreeBSD.org>

Update the text of an assertion to reflect changes made in revision 1.148.
Submitted by: tegge

Eliminate an unnecessary, temporary increment of the backing object's
reference count in vm_object_qcollapse(). Reviewed by: tegge


# ae51ff11 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove GIANT_REQUIRED where giant is no longer required.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.

Sponsored By: Isilon Systems, Inc.


# d936694f 15-Jan-2005 Alan Cox <alc@FreeBSD.org>

Consider three objects, O, BO, and BBO, where BO is O's backing object
and BBO is BO's backing object. Now, suppose that O and BO are being
collapsed. Furthermore, suppose that BO has been marked dead
(OBJ_DEAD) by vm_object_backing_scan() and that either
vm_object_backing_scan() has been forced to sleep due to encountering
a busy page or vm_object_collapse() has been forced to sleep due to
memory allocation in the swap pager. If vm_object_deallocate() is
then called on BBO and BO is BBO's only shadow object,
vm_object_deallocate() will collapse BO and BBO. In doing so, it adds
a necessary temporary reference to BO. If this collapse also sleeps
and the prior collapse resumes first, the temporary reference will
cause vm_object_collapse to panic with the message "backing_object %p
was somehow re-referenced during collapse!"

Resolve this race by changing vm_object_deallocate() such that it
doesn't collapse BO and BBO if BO is marked dead. Once O and BO are
collapsed, vm_object_collapse() will attempt to collapse O and BBO.
So, vm_object_deallocate() on BBO need do nothing.

Reported by: Peter Holm on 20050107
URL: http://www.holm.cc/stress/log/cons102.html

In collaboration with: tegge@
Candidate for RELENG_4 and RELENG_5
MFC after: 2 weeks


# 7c0745ee 14-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 5ba514bc 08-Jan-2005 Alan Cox <alc@FreeBSD.org>

Move the acquisition and release of the page queues lock outside of a loop
in vm_object_split() to avoid repeated acquisition and release.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 98fe9a0d 17-Dec-2004 Alan Cox <alc@FreeBSD.org>

Eliminate another unnecessary call to vm_page_busy(). (See revision 1.333
for a detailed explanation.)


# 90688d13 07-Dec-2004 Alan Cox <alc@FreeBSD.org>

With the removal of kern/uipc_jumbo.c and sys/jumbo.h,
vm_object_allocate_wait() is not used. Remove it.


# dad740e9 06-Nov-2004 Alan Cox <alc@FreeBSD.org>

Eliminate an unnecessary atomic operation. Articulate the rationale in
a comment.


# 19187819 05-Nov-2004 Alan Cox <alc@FreeBSD.org>

Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc()
because this call is only needed to wake threads that slept when they
discovered a dead object connected to a vnode. To eliminate unnecessary
calls to wakeup() by vnode_pager_dealloc(), introduce a new flag,
OBJ_DISCONNECTWNT.

Reviewed by: tegge@


# b546ac54 04-Nov-2004 Alan Cox <alc@FreeBSD.org>

Eliminate another unnecessary call to vm_page_busy() that immediately
precedes a call to vm_page_rename(). (See the previous revision for a
detailed explanation.)


# d19ef814 03-Nov-2004 Alan Cox <alc@FreeBSD.org>

The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.


# 9b98b796 29-Aug-2004 Alan Cox <alc@FreeBSD.org>

Move the acquisition and release of the lock on the object at the head of
the shadow chain outside of the loop in vm_object_madvise(), reducing the
number of times that this lock is acquired and released.


# b23f72e9 01-Aug-2004 Brian Feldman <green@FreeBSD.org>

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


# 874f0135 30-Jul-2004 Doug Rabson <dfr@FreeBSD.org>

Fix handling of msync(2) for character special files.

Submitted by: nvidia


# 56e0670f 28-Jul-2004 Alan Cox <alc@FreeBSD.org>

Correct a very old error in both vm_object_madvise() (originating in
vm/vm_object.c revision 1.88) and vm_object_sync() (originating in
vm/vm_map.c revision 1.36): When descending a chain of backing objects,
both use the wrong object's backing offset. Consequently, both may
operate on the wrong pages.

Quoting Matt, "This could be responsible for all of the sporatic madvise
oddness that has been reported over the years."

Reviewed by: Matt Dillon


# 9b45f815 25-Jul-2004 Alan Cox <alc@FreeBSD.org>

Remove spl calls.


# 57a21aba 25-Jul-2004 Alan Cox <alc@FreeBSD.org>

Make the code and comments for vm_object_coalesce() consistent.


# 5285558a 22-Jul-2004 Alan Cox <alc@FreeBSD.org>

- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:

- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)


# 9174ca7b 28-Jun-2004 Tor Egge <tegge@FreeBSD.org>

Initialize result->backing_object_offset before linking result onto the list of
vm objects shadowing source in vm_object_shadow(). This closes a race where
vm_object_collapse() could be called with a partially uninitialized object
argument causing symptoms that looked like hardware problems, e.g. signal 6,
10, 11 or a /bin/sh busy-waiting for a nonexistant child process.


# c53f7ace 25-May-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

MFS: vm_map.c rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE
semantics but provide a sysctl knob for reverting to old ones.


# 05eb3785 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# a7d86121 07-Mar-2004 Alan Cox <alc@FreeBSD.org>

Implement a work around for the deadlock avoidance case in
vm_object_deallocate() so that it doesn't spin forever either.

Submitted by: bde


# 85b8d6b4 21-Feb-2004 Alan Cox <alc@FreeBSD.org>

Correct a long-standing race condition in vm_object_page_remove() that
could result in a dirty page being unintentionally freed.

Reviewed by: tegge
MFC after: 7 days


# 23b186d3 17-Jan-2004 Alan Cox <alc@FreeBSD.org>

Don't acquire Giant in vm_object_deallocate() unless the object is vnode-
backed.


# d0058957 02-Jan-2004 Alan Cox <alc@FreeBSD.org>

Revision 1.74 of vm_meter.c ("Avoid lock-order reversal") makes the release
and subsequent reacquisition of the same vm object lock in
vm_object_collapse() unnecessary.


# 4da9f125 30-Dec-2003 Alan Cox <alc@FreeBSD.org>

- Modify vm_object_split() to expect a locked vm object on entry and
return on a locked vm object on exit. Remove GIANT_REQUIRED.
- Eliminate some unnecessary local variables from vm_object_split().


# 950f8459 08-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact
that msync(2) is its only caller.
- Migrate the parts of the old vm_map_clean() that examined the internals
of a vm object to a new function vm_object_sync() that is implemented in
vm_object.c. At the same, introduce the necessary vm object locking so
that vm_map_sync() and vm_object_sync() can be called without Giant.

Reviewed by: tegge


# 63f6cefc 02-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Increase the scope of two vm object locks in vm_object_split().


# b921a12b 02-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Introduce and use vm_object_reference_locked(). Unlike
vm_object_reference(), this function must not be used to reanimate dead
vm objects. This restriction simplifies locking.

Reviewed by: tegge


# 22ec553f 01-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Increase the scope of two vm object locks in vm_object_collapse().
- Remove the acquisition and release of Giant from vm_object_coalesce().


# c7c8dd7e 01-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Modify swap_pager_copy() and its callers such that the source and
destination objects are locked on entry and exit. Add comments to
the callers noting that the locks can be released by swap_pager_copy().
- Remove several instances of GIANT_REQUIRED.


# de33bedd 31-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Additional vm object locking in vm_object_split()
- New vm object locking assertions in vm_page_insert() and
vm_object_set_writeable_dirty()


# 3b9a4cb6 31-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Revert a part of revision 1.73: Make vm_object_set_flag() an inline
function. This function is so trivial that inlining reduces the size
of the kernel.


# dc6279b8 31-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Take advantage of the swap pager locking: Eliminate the use of Giant
from vm_object_madvise().
- Remove excessive blank lines from vm_object_madvise().


# 43186e53 26-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Simplify vm_object_collapse()'s collapse case, reducing the number
of lock acquires and releases performed.
- Move an assertion from vm_object_collapse() to vm_object_zdtor()
because it applies to all cases of object destruction.


# 7a935082 18-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Increase the object lock's scope in vm_contig_launder() so that access
to the object's type field and the call to vm_pageout_flush() are
synchronized.
- The above change allows for the eliminaton of the last parameter
to vm_pageout_flush().
- Synchronize access to the page's valid field in vm_pageout_flush()
using the containing object's lock.


# f3c625e4 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Use the UMA_ZONE_VM flag on the fakepg and object zones to prevent
vm recursion and LORs. This may be necessary for other zones created in
the vm but this needs to be verified.


# 1dabe306 17-Sep-2003 Alan Cox <alc@FreeBSD.org>

Remove GIANT_REQUIRED from vm_object_shadow().


# 82f9defe 14-Sep-2003 Alan Cox <alc@FreeBSD.org>

Eliminate the use of Giant from vm_object_reference().


# b881da26 13-Sep-2003 Alan Cox <alc@FreeBSD.org>

There is no need for an atomic increment on the vm object's generation
count in _vm_object_allocate(). (Access to the generation count is
governed by the vm object's lock.) Note: the introduction of the
atomic increment in revision 1.238 appears to be an accident. The
purpose of that commit was to fix an Alpha-specific bug in UMA's
debugging code.


# 07f81f91 05-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove an unused variable.


# 9c65e7a3 26-Jul-2003 Alan Cox <alc@FreeBSD.org>

Allow vm_object_reference() on kernel_object without Giant.


# b4ae4780 22-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Don't inline very large functions.

Gcc has silently not been doing this for a long time.


# 7ca33ad1 21-Jun-2003 Alan Cox <alc@FreeBSD.org>

Complete the vm object locking in vm_object_backing_scan(); specifically,
deal with the case where we need to sleep on a busy page with two vm object
locks held.


# 06ecade7 20-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Increase the scope of the vm object lock in vm_object_collapse().
- Assert that the vm object and its backing vm object are both locked in
vm_object_qcollapse().


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 3471677c 09-Jun-2003 Alan Cox <alc@FreeBSD.org>

Don't use vm_object_set_flag() to initialize the vm object's flags.


# 138449dc 08-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Properly handle the paging_in_progress case on two vm objects in
vm_object_deallocate().
- Remove vm_object_pip_sleep().


# d7fc2210 06-Jun-2003 Alan Cox <alc@FreeBSD.org>

Pass the vm object to vm_object_collapse() with its lock held.


# 40b808a8 05-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Extend the scope of the backing object's lock in vm_object_collapse().


# b72b0115 04-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Add further vm object locking to vm_object_deallocate(), specifically,
for accessing a vm object's shadows.


# 3b68228c 04-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Add vm object locking to vm_object_deallocate(). (Still more
changes are required.)
- Remove special-case macros for kmem object locking. They are
no longer used.


# bdbfbaaf 03-Jun-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to vm_object_coalesce().


# cccf11b8 01-Jun-2003 Alan Cox <alc@FreeBSD.org>

Change kernel_object and kmem_object to (&kernel_object_store) and
(&kmem_object_store), respectively. This allows the address of these
objects to be resolved at link-time rather than run-time.


# 34567de7 31-May-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to vm_object_madvise().


# 1c500307 17-May-2003 Alan Cox <alc@FreeBSD.org>

Reduce the size of a vm object by converting its shadow list from a TAILQ
to a LIST.

Approved by: re (rwatson)


# 3a12f5da 08-May-2003 Alan Cox <alc@FreeBSD.org>

Give the kmem object's mutex a unique name, instead of "vm object",
to avoid false reports of lock-order reversal with a system map mutex.

Approved by: re (jhb)


# 658ad5ff 05-May-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object when performing vm_pager_deallocate().


# f7dd7b63 04-May-2003 Alan Cox <alc@FreeBSD.org>

Extend the scope of the vm_object lock in vm_object_terminate().


# ad682c48 03-May-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object on entry to vm_object_vndeallocate().


# bff99f0d 03-May-2003 Alan Cox <alc@FreeBSD.org>

- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.

Reported by: iedowse


# f92039a1 02-May-2003 Alan Cox <alc@FreeBSD.org>

Move a declaration to its proper place.


# 6be36525 01-May-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object when updating its shadow list.


# 4f7c7f6e 01-May-2003 Alan Cox <alc@FreeBSD.org>

Simplify the removal of a shadow object in vm_object_collapse().


# 8e3a76fb 30-Apr-2003 Alan Cox <alc@FreeBSD.org>

Extend the scope of the vm_object locking in vm_object_split().


# 15347817 30-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Update the vm_object locking in vm_object_reference().
- Convert some dead code in vm_object_reference() into a comment.


# ed6a7863 27-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Define VM_OBJECT_LOCK_INIT().
- Avoid repeatedly mtx_init()ing and mtx_destroy()ing the vm_object's lock
using UMA's uminit callback, in this case, vm_object_zinit().


# c9917419 27-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Tell witness that holding two or more vm_object locks is okay.
- In vm_object_deallocate(), lock the child when removing the parent
from the child's shadow list.


# 570a2f4a 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

Various changes to vm_object_shadow(): (1) update the vm_object locking,
(2) remove a pointless assertion, and (3) make a trivial change to a
comment.


# ecde4b32 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

Various changes to vm_object_page_remove():
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
size.
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.


# c829b9d0 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object on entry to vm_object_terminate().


# 1ca58953 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


# 155080d3 25-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Extend the scope of two existing vm_object locks to cover
swap_pager_freespace().


# b6e48e03 23-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)


# d647a0ed 21-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Assert that the vm_object is locked in vm_object_clear_flag(),
vm_object_pip_add() and vm_object_pip_wakeup().
- Remove GIANT_REQUIRED from vm_object_pip_subtract() and
vm_object_pip_subtract().
- Lock the vm_object when performing vm_object_page_remove().


# d7a013c3 20-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing either vm_object_clear_flag() or
vm_object_pip_wakeup().


# d22bc710 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_add().


# 0fa05eae 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_subtract().
- Assert that the vm_object lock is held in vm_object_pip_subtract().


# 0d420ad3 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_wakeupn().
- Assert that the vm_object lock is held in vm_object_pip_wakeupn().
- Add a new macro VM_OBJECT_LOCK_ASSERT().


# d1dc776d 13-Apr-2003 Alan Cox <alc@FreeBSD.org>

Lock some manipulations of the vm object's flags.


# e2479b4f 13-Apr-2003 Alan Cox <alc@FreeBSD.org>

Lock some manipulations of the vm object's flags.


# f279b88d 12-Apr-2003 Alan Cox <alc@FreeBSD.org>

Permit vm_object_pip_add() and vm_object_pip_wakeup() on the kmem_object
without Giant held.


# 227f9a1c 24-Mar-2003 Jake Burkholder <jake@FreeBSD.org>

- Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)


# b4b138c2 18-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 09c80124 05-Mar-2003 Alan Cox <alc@FreeBSD.org>

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 75741c04 26-Jan-2003 Alan Cox <alc@FreeBSD.org>

Simplify vm_object_page_remove(): The object's memq is now ordered. The
two cases that existed before for performance optimization purposes can
be reduced to one.


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 4dbeceee 04-Jan-2003 Alan Cox <alc@FreeBSD.org>

Use vm_object_lock() and vm_object_unlock() in vm_object_deallocate().
(This procedure needs further work, but this change is sufficient for
locking the kmem_object.)


# 5440b5a9 03-Jan-2003 Alan Cox <alc@FreeBSD.org>

Refine the assertion in vm_object_clear_flag() to allow operation on the
kmem_object without Giant. In that case, assert that the kmem_object's
mutex is held.


# 9d5abbdd 01-Jan-2003 Jens Schweikhardt <schweikh@FreeBSD.org>

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# e3a9e1b2 29-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Remove vm_object_init2(). It is unused.
- Add a mtx_destroy() to vm_object_collapse(). (This allows a bzero()
to migrate from _vm_object_allocate() to vm_object_zinit(), where it
will be performed less often.)


# a28cc55e 29-Dec-2002 Alan Cox <alc@FreeBSD.org>

Reduce the number of times that we acquire and release the page queues
lock by making vm_page_rename()'s caller, rather than vm_page_rename(),
responsible for acquiring it.


# 43b7990e 28-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Allow the VM object flushing code to cluster. When the filesystem syncer
comes along and flushes a file which has been mmap()'d SHARED/RW, with
dirty pages, it was flushing the underlying VM object asynchronously,
resulting in thousands of 8K writes. With this change the VM Object flushing
code will cluster dirty pages in 64K blocks.

Note that until the low memory deadlock issue is reviewed, it is not safe
to allow the pageout daemon to use this feature. Forced pageouts still
use fs block size'd ops for the moment.

MFC after: 3 days


# 35c01631 27-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Change vm_object_page_collect_flush() to assert rather than
acquire the page queues lock.
- Acquire the page queues lock in vm_object_page_clean().


# dc907f66 23-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Hold the page queues lock around vm_page_wakeup().


# 4b420d50 19-Dec-2002 Alan Cox <alc@FreeBSD.org>

Add a mutex to struct vm_object. Initialize and destroy that mutex
at appropriate times. For the moment, the mutex is only used on
the kmem_object.


# cf3e6e48 19-Dec-2002 Alan Cox <alc@FreeBSD.org>

Remove the hash_rand field from struct vm_object. As of revision 1.215 of
vm/vm_page.c, it is unused.


# bd82dc74 17-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Hold the page queues lock when performing vm_page_busy().
- Replace vm_page_sleep_busy() with proper page queues locking
and vm_page_sleep_if_busy().


# 2840cabe 15-Dec-2002 Alan Cox <alc@FreeBSD.org>

As per the comments, vm_object_page_remove() now expects its caller to lock
the object (i.e., acquire Giant).


# 3a199de3 27-Nov-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock while performing pmap_page_protect().

Approved by: re (blanket)


# 13dc71ed 23-Nov-2002 Alan Cox <alc@FreeBSD.org>

Extend the scope of the page queues/fields locking in vm_freeze_copyopts()
to cover pmap_remove_all().

Approved by: re


# a12cc0e4 17-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove vm_page_protect(). Instead, use pmap_page_protect() directly.


# 4fec79be 16-Nov-2002 Alan Cox <alc@FreeBSD.org>

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


# 81b9ee99 13-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove dead code that hasn't been needed since the demise of share maps
in various revisions of vm/vm_map.c between 1.148 and 1.153.


# 81f71eda 11-Nov-2002 Matt Jacob <mjacob@FreeBSD.org>

atomic_set_8 isn't MI. Instead, follow Jake's suggestions about
ZONE_LOCK.


# d154fb4f 10-Nov-2002 Alan Cox <alc@FreeBSD.org>

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


# e47cd172 07-Nov-2002 Maxime Henrion <mux@FreeBSD.org>

Some more printf() format fixes.


# b86ec922 18-Oct-2002 Matthew Dillon <dillon@FreeBSD.org>

Replace the vm_page hash table with a per-vmobject splay tree. There should
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.

Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.

Submitted by: alc (this is Alan's code)
Approved by: re


# 3ef3e7c4 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of the unused LK_NOOBJ.


# 15c176c1 24-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Use vm_object_lock() in place of directly locking Giant.

Reviewed by: md5


# 99cb3c4c 10-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_activate().


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 00f9e8b4 02-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Convert two instances of vm_page_sleep_busy() into vm_page_sleep_if_busy()
with appropriate page queue locking.


# 91bb74a8 01-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_deactivate().


# 32585dd6 30-Jul-2002 Alan Cox <alc@FreeBSD.org>

o In vm_object_madvise() and vm_object_page_remove() replace
vm_page_sleep_busy() with vm_page_sleep_if_busy(). At the same time,
increase the scope of the page queues lock. (This should significantly
reduce the locking overhead in vm_object_page_remove().)
o Apply some style fixes.


# 6a684ecf 28-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_free().


# 55df3298 27-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Require that the page queues lock is held on entry to vm_pageout_clean()
and vm_pageout_flush().
o Acquire the page queues lock before calling vm_pageout_clean()
or vm_pageout_flush().


# f4f5cb1f 25-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Remove a vm_page_deactivate() that is immediately followed by a
vm_page_rename() from vm_object_backing_scan(). vm_page_rename()
also performs vm_page_deactivate() on pages in the cache queues,
making the removed vm_page_deactivate() redundant.


# 2999e9fa 22-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_dontneed().
o Assert that the page queue lock is held in vm_page_dontneed().


# 56030358 09-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock accesses to the page queues in vm_object_terminate().
o Eliminate some unnecessary 64-bit arithmetic in vm_object_split().


# c7118ed6 07-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Traverse the object's memq rather than repeatedly calling vm_page_lookup()
in vm_object_split().


# 300b96ac 29-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Change the type of `tscan' in vm_object_page_clean() to vm_pindex_t,
as it stores an absolute page index that may not fit in a vm_offset_t.


# 23f09d50 26-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Avoid using the 64-bit vm_pindex_t in a few places where 64-bit
types are not required, as the overhead is unnecessary:

o In the i386 pmap_protect(), `sindex' and `eindex' represent page
indices within the 32-bit virtual address space.
o In swp_pager_meta_build() and swp_pager_meta_ctl(), use a temporary
variable to store the low few bits of a vm_pindex_t that gets used
as an array index.
o vm_uiomove() uses `osize' and `idx' for page offsets within a
map entry.
o In vm_object_split(), `idx' is a page offset within a map entry.


# 98cb733c 25-Jun-2002 Kenneth D. Merry <ken@FreeBSD.org>

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# 6395da54 25-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Complete the initial set of VM changes required to support full
64-bit file sizes. This step simply addresses the remaining overflows,
and does attempt to optimise performance. The details are:

o Use a 64-bit type for the vm_object `size' and the size argument
to vm_object_allocate().
o Use the correct type for index variables in dev_pager_getpages(),
vm_object_page_clean() and vm_object_page_remove().
o Avoid an overflow in the i386 pmap_object_init_pt().


# 00e1854a 19-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Replace GIANT_REQUIRED in vm_object_coalesce() by the acquisition and
release of Giant.
o Reduce the scope of GIANT_REQUIRED in vm_map_insert().

These changes will enable us to remove the acquisition and release
of Giant from obreak().


# c5aaa06d 02-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Migrate vm_map_split() from vm_map.c to vm_object.c, renaming it
to vm_object_split(). Its interface should still be changed
to resemble vm_object_shadow().


# 72353893 02-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Condition vm_object_pmap_copy_1()'s compilation on the kernel
option ENABLE_VFS_IOOPT. Unless this option is in effect,
vm_object_pmap_copy_1() is not used.


# 9917e010 30-May-2002 Alan Cox <alc@FreeBSD.org>

Further work on pushing Giant out of the vm_map layer and down
into the vm_object layer:
o Acquire and release Giant in vm_object_shadow() and
vm_object_page_remove().
o Remove the GIANT_REQUIRED assertion preceding vm_map_delete()'s call
to vm_object_page_remove().
o Remove the acquisition and release of Giant around vm_map_lookup()'s
call to vm_object_shadow().


# 094f6d26 18-May-2002 Alan Cox <alc@FreeBSD.org>

o Remove GIANT_REQUIRED from vm_map_madvise(). Instead, acquire and
release Giant around vm_map_madvise()'s call to pmap_object_init_pt().
o Replace GIANT_REQUIRED in vm_object_madvise() with the acquisition
and release of Giant.
o Remove the acquisition and release of Giant from madvise().


# 47c3ccc4 11-May-2002 Alan Cox <alc@FreeBSD.org>

o Acquire and release Giant in vm_object_reference() and
vm_object_deallocate(), replacing the assertion GIANT_REQUIRED.
o Remove GIANT_REQUIRED from vm_map_protect() and vm_map_simplify_entry().
o Acquire and release Giant around vm_map_protect()'s call to pmap_protect().

Altogether, these changes eliminate the need for mprotect() to acquire
and release Giant.


# c0b6bbb8 05-May-2002 Alan Cox <alc@FreeBSD.org>

o Condition the compilation and use of vm_freeze_copyopts()
on ENABLE_VFS_IOOPT.


# dcc5840e 05-May-2002 Alan Cox <alc@FreeBSD.org>

o Some improvements to the page coloring of vm objects, particularly,
for shadow objects.

Submitted by: bde


# e86256c1 05-May-2002 Alan Cox <alc@FreeBSD.org>

o Move vm_freeze_copyopts() from vm_map.{c.h} to vm_object.{c,h}. It's plainly
an operation on a vm_object and belongs in the latter place.


# 79660d83 04-May-2002 Alan Cox <alc@FreeBSD.org>

o Make _vm_object_allocate() and vm_object_allocate() callable
without holding Giant.
o Begin documenting the trivial cases of the locking protocol
on vm_object.


# a5698387 20-Apr-2002 Alan Cox <alc@FreeBSD.org>

Reintroduce locking on accesses to vm_object_list.


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 670d17b5 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 9f0567f5 17-Mar-2002 Alan Cox <alc@FreeBSD.org>

Remove vm_object_count: It's unused, incorrectly maintained and duplicates
information maintained by the zone allocator.


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# b9b7a4be 05-Mar-2002 Matthew Dillon <dillon@FreeBSD.org>

Add a sequential iteration optimization to vm_object_page_clean(). This
moderately improves msync's and VM object flushing for objects containing
randomly dirtied pages (fsync(), msync(), filesystem update daemon),
and improves cpu use for small-ranged sequential msync()s in the face of
very large mmap()ings from O(N) to O(1) as might be performed by a database.

A sysctl, vm.msync_flush_flag, has been added and defaults to 3 (the two
committed optimizations are turned on by default). 0 will turn off both
optimizations.

This code has already been tested under stable and is one in a series of
memq / vp->v_dirtyblkhd / fsync optimizations to remove O(N^2) restart
conditions that will be coming down the pipe.

MFC after: 3 days


# 7a5a6352 26-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Move recently added procedure which was incorrectly placed within an
#ifdef DDB block.


# 245df27c 25-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# b06805ad 30-Jul-2001 Jake Burkholder <jake@FreeBSD.org>

Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with: dillon


# 1b40f8c0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

Change inlines back into mainline code in preparation for mutexing. Also,
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.

(kernel size also reduced by 50-100K, but that isn't the prime intent)


# 54d92145 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

whitespace / register cleanup


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 08442f8a 22-Jun-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Introduce numerous SMP friendly changes to the mbuf allocator. Namely,
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:

o Reduce contention for SMP by offering per-CPU pools and locks.
o Better use of data cache due to per-CPU pools.
o Much less code cache pollution due to excessively large allocation macros.
o Framework for `grouping' objects from same page together so as to be able
to possibly free wired-down pages back to the system if they are no longer
needed by the network stacks.

Additional things changed with this addition:

- Moved some mbuf specific declarations and initializations from
sys/conf/param.c into mbuf-specific code where they belong.
- m_getclr() has been renamed to m_get_clrd() because the old name is really
confusing. m_getclr() HAS been preserved though and is defined to the new
name. No tree sweep has been done "to change the interface," as the old
name will continue to be supported and is not depracated. The change was
merely done because m_getclr() sounds too much like "m_get a cluster."
- TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
systat(1) (see TODO below).
- Fixed systat(1) to display number of "free mbufs" based on new per-CPU
stat structures.
- Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
per-CPU stat structures. All infos are fetched via sysctl.

TODO (in order of priority):

- Re-enable mbtypes statistics in both netstat(1) and systat(1) after
introducing an SMP friendly way to collect the mbtypes stats under the
already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
seems too costly for a mere stat update, especially when other locks are
already present).
- Optionally have systat(1) display not only "total free mbufs" but also
"total free mbufs per CPU pool."
- Fix minor length-fetching issues in netstat(1) related to recently
re-enabled option to read mbuf stats from a core file.
- Move reference counters at least for mbuf clusters into an unused portion
of the cluster itself, to save space and need to allocate a counter.
- Look into introducing resource freeing possibly from a kproc.

Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/


# 60517fd1 23-May-2001 John Baldwin <jhb@FreeBSD.org>

- Assert that the vm lock is held for all of _vm_object_allocate().
- Restore the previous order of setting up a new vm_object. The previous
had a small bug where we zero'd out the flags after we set the
OBJ_ONEMAPPING flag.
- Add several asserts of vm_mtx.
- Assert Giant is held rather than locking and unlocking it in a few
places.
- Add in some #ifdef objlocks code to lock individual vm objects when
vm objects each have their own lock someday.
- Don't bother acquiring the allproc lock for a ddb command. If DDB
blocked on the lock, that would be worse than having an inconsistent
allproc list.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# cc64b484 15-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

use TAILQ_FOREACH, fix a comment's location


# 971dd342 13-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

if/panic -> KASSERT


# 1005a129 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Convert the allproc and proctree locks from lockmgr locks to sx locks.


# 8125b1e6 04-Mar-2001 Alfred Perlstein <alfred@FreeBSD.org>

Simplify vm_object_deallocate(), by decrementing the refcount first.
This allows some of the conditionals to be combined.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 1b367556 23-Jan-2001 Jason Evans <jasone@FreeBSD.org>

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 21cd6e62 13-Dec-2000 Seigo Tanimura <tanimura@FreeBSD.org>

- If swap metadata does not fit into the KVM, reduce the number of
struct swblock entries by dividing the number of the entries by 2
until the swap metadata fits.

- Reject swapon(2) upon failure of swap_zone allocation.

This is just a temporary fix. Better solutions include:
(suggested by: dillon)

o reserving swap in SWAP_META_PAGES chunks, and
o swapping the swblock structures themselves.

Reviewed by: alfred, dillon


# c0c25570 12-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


# 553629eb 22-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


# 8b03c8ed 29-May-2000 Matthew Dillon <dillon@FreeBSD.org>

This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.

A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.

Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# d7414c44 19-Apr-2000 Alan Cox <alc@FreeBSD.org>

vm_object_shadow: Remove an incorrect assertion. In obscure circumstances
vm_object_shadow can be called on an object with ref_count > 1 and
OBJ_ONEMAPPING set. This isn't really a problem for vm_object_shadow.


# 5929bcfa 27-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 956f3135 26-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Spelling


# db5f635a 16-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


# 4f79d873 11-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 02577fa2 28-Oct-1999 Alan Cox <alc@FreeBSD.org>

Remove the last vestiges of "vm_map_t phys_map". It's been unused
since i386/i386/machdep.c rev 1.45 (or 1994 :-) ).


# 479112df 16-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove inappropriate VOP_FSYNC from vm_object_page_clean(). The fsync
syncs the entire underlying file rather then just the requested range,
resulting in huge inefficiencies when the VM system is articulated in
a certain way. The VOP_FSYNC was also found to massively reduce NFS
performance in certain cases.

Change MADV_DONTNEED and MADV_FREE to call vm_page_dontneed() instead
of vm_page_deactivate(). Using vm_page_deactivate() causes all
inactive and cache pages to be recycled before the dontneed/free page
is recycled, effectively flushing our entire VM inactive & cache
queues continuously even if only a few pages are being actively MADV
free'd and reused (such as occurs with a sequential scan of a
memory-mapped file).

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 76782487 15-Aug-1999 Alan Cox <alc@FreeBSD.org>

Remove the declarations for "vm_map_t io_map". It's been unused
since i386/i386/machdep rev 1.310, i.e., the demise of BOUNCE_BUFFERS.


# aecb0ebb 15-Aug-1999 Alan Cox <alc@FreeBSD.org>

Remove the declarations for "vm_map_t u_map". It's been unused
since i386/i386/pmap rev 1.190. (The alpha never used it.)


# 193b9358 12-Aug-1999 Alan Cox <alc@FreeBSD.org>

vm_object_madvise:
Update the comments to match the implementation.

Submitted by: dillon


# 58b4e6cc 12-Aug-1999 Alan Cox <alc@FreeBSD.org>

vm_object_madvise:
Support MADV_DONTNEED and MADV_WILLNEED on object types
besides OBJT_DEFAULT and OBJT_SWAP.

Submitted by: dillon


# ce9edcf5 09-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Merge the cons.c and cons.h to the best of my ability. alpha may or
may not compile, I can't test it.


# 7f866e4b 01-Aug-1999 Alan Cox <alc@FreeBSD.org>

Move the memory access behavior information provided by madvise
from the vm_object to the vm_map.

Submitted by: dillon


# 9b21395a 15-Jul-1999 Alan Cox <alc@FreeBSD.org>

Remove vm_object::last_read. It is used by the old swap pager, but
not by the new one, i.e., vm/swap_pager.c rev 1.108.

Reviewed by: dillon@backplane.com


# 32b76dfa 11-Jul-1999 Alan Cox <alc@FreeBSD.org>

Cleanup OBJ_ONEMAPPING management.

vm_map.c:
Don't set OBJ_ONEMAPPING on arbitrary vm objects. Only default
and swap type vm objects should have it set. vm_object_deallocate
already handles these cases.

vm_object.c:
If OBJ_ONEMAPPING isn't already clear in vm_object_shadow,
we are in trouble. Instead of clearing it, make it
an assertion that it is already clear.


# 3efc015b 01-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Fix some int/long printf problems for the Alpha


# 60ff97b0 20-Jun-1999 Alan Cox <alc@FreeBSD.org>

Remove vm_object::cache_count and vm_object::wired_count. They are
not used. (Nor is there any planned use by John who introduced them.)

Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>


# c7997d57 29-May-1999 Alan Cox <alc@FreeBSD.org>

Addendum to 1.155. Verify the existence of the object before checking
its reference count.


# 9a2f6362 27-May-1999 Alan Cox <alc@FreeBSD.org>

Avoid the creation of unnecessary shadow objects.


# ea41812f 15-May-1999 Alan Cox <alc@FreeBSD.org>

Remove prototypes for functions that don't exist anymore (vm_map.h).

Remove a useless argument from vm_map_madvise's interface (vm_map.c,
vm_map.h, and vm_mmap.c).

Remove a redundant test in vm_uiomove (vm_map.c).

Make two changes to vm_object_coalesce:

1. Determine whether the new range of pages actually overlaps
the existing object's range of pages before calling vm_object_page_remove.
(Prior to this change almost 90% of the calls to vm_object_page_remove
were to remove pages that were beyond the end of the object.)

2. Free any swap space allocated to removed pages.


# a1a54e9f 13-Mar-1999 Alan Cox <alc@FreeBSD.org>

Correct two optimization errors in vm_object_page_remove:

1. The size of vm_object::memq is vm_object::resident_page_count,
not vm_object::size.

2. The "size > 4" test sometimes results in the traversal of a ~1000 page
memq in order to locate ~10 pages.


# d1bf5d56 24-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove unnecessary page protects on map_split and collapse operations.
Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do
not get set under certain circumstances ( page rename case ).

Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson


# 1ce137be 14-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix a bug in the new madvise() code that would possibly (improperly)
free swap space out from under a busy page. This is not legal because
the swap may be reallocated and I/O issued while I/O is still in
progress on the same swap page from the madvise()'d object. This bug
could only occur under extreme paging conditions but might not cause
an error until much later. As a side-benefit, madvise() is now even
smaller.


# 41c67e12 12-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Minor optimization to madvise() MADV_FREE to make page as freeable as
possible without actually unmapping it from the process.

As of now, I declare madvise() on OBJT_DEFAULT/OBJT_SWAP objects to be
'working and complete'.


# 2aaeadf8 12-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix non-fatal bug in vm_map_insert() which improperly cleared
OBJ_ONEMAPPING in the case where an object is extended by an
additional vm_map_entry must be allocated.

In vm_object_madvise(), remove calll to vm_page_cache() in MADV_FREE
case in order to avoid a page fault on page reuse. However, we still
mark the page as clean and destroy any swap backing store.

Submitted by: Alan Cox <alc@cs.rice.edu>


# 2ad1a3f7 08-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Revamp vm_object_[q]collapse(). Despite the complexity of this patch,
no major operational changes were made. The three core object->memq loops
were moved into a single inline procedure and various operational
characteristics of the collapse function were documented.


# d031cff1 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

General cleanup. Remove #if 0's and remove useless register qualifiers.


# 9fdfe602 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.


# 9b09fe24 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

When shadowing objects, adjust the page coloring of the shadowing object
such that pages in the combined/shadowed object are consistantly
colored.

Submitted by: "John S. Dyson" <dyson@iquest.net>


# 588059be 04-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix bug in a KASSERT I introduced in vm_page_qcollapse() rev 1.139.

Since paging is in progress, page scan in vm_page_qcollapse() must be
protected at atleast splbio() to prevent pages from being ripped out from
under the scan.


# 4112823f 02-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Submitted by: Alan Cox

The vm_map_insert()/vm_object_coalesce() optimization has been extended
to include OBJT_SWAP objects as well as OBJT_DEFAULT objects. This is
possible because it costs nothing to extend an OBJT_SWAP object with
the new swapper. We can't do this with the old swapper. The old swapper
used a linear array that would have had to have been reallocated, costing
time as well as a potential low-memory deadlock.


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 8e3ad7c9 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Depreciate vm_object_pmap_copy() - nobody uses it. Everyone uses
vm_object_pmap_copt_1() now, apparently.


# 7bc9e80e 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

object->id was badly implemented. It has simply been removed.

object->paging_offset has been removed - it was used to optimize a
single OBJT_SWAP collapse case yet introduced massive confusion throughout
vm_object.c. The optimization was inconsequential except for the
claim that it didn't have to allocate any memory. The optimization
has been removed.

madvise() has been fixed. The old madvise() could be made to operate
on shared objects which is a big no-no. The new one is much more careful
in what it modifies. MADV_FREE was totally broken and has now been fixed.

vm_page_rename() now automatically dirties a page, so explicit dirtying
of the page prior to calling vm_page_rename() has been removed.


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 289bdf33 02-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Ifdefed conditionally used simplock variables.


# dd0b2081 05-Nov-1998 David Greenman <dg@FreeBSD.org>

Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.


# e4b7635d 27-Oct-1998 David Greenman <dg@FreeBSD.org>

Added needed splvm() protection around object page traversal in
vm_object_terminate().


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 9fcfb650 22-Oct-1998 David Greenman <dg@FreeBSD.org>

Oops, revert part of last fix. vm_pager_dealloc() can't be called until
after the pages are removed from the object...so fix the problem by
not printing the diagnostic for wired fictitious pages (which is normal).


# 356863eb 22-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two bugs in recent commit: in vm_object_terminate, vm_pager_dealloc
needs to be called prior to freeing remaining pages in the object so that
the device pager has an opportunity to grab its "fake" pages. Also, in
the case of wired pages, the page must be made busy prior to calling
vm_page_remove. This is a difference from 2.2.x that I overlooked when
I brought these changes forward.


# 0b10ba98 21-Oct-1998 David Greenman <dg@FreeBSD.org>

Make the VM system handle the case where a terminating object contains
legitimately wired pages. Currently we print a diagnostic when this
happens, but this will be removed soon when it will be common for this
to occur with zero-copy TCP/IP buffers.


# ce65e68c 27-Sep-1998 David Greenman <dg@FreeBSD.org>

Be more selctive about when we clear p->valid.
Submitted by: John Dyson <toor@dyson.iquest.net>


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# d474eaaa 06-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Protect all modifications to paging_in_progress with splvm(). The i386
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.

With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)


# eb95adef 13-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Print pointers using %p instead of attempting to print them by
casting them to long, etc. Fixed some nearby printf bogons (sign
errors not warned about by gcc, and style bugs, but not truncation
of vm_ooffset_t's).


# fc62ef1f 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# e5b19842 21-Jun-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# cf2819cc 21-May-1998 John Dyson <dyson@FreeBSD.org>

Make flushing dirty pages work correctly on filesystems that
unexpectedly do not complete writes even with sync I/O requests.
This should help the behavior of mmaped files when using
softupdates (and perhaps in other circumstances also.)


# c0877f10 28-Apr-1998 John Dyson <dyson@FreeBSD.org>

Tighten up management of memory and swap space during map allocation,
deallocation cycles. This should provide a measurable improvement
on swap and memory allocation on loaded systems. It is unlikely a
complete solution. Also, provide more map info with procfs.
Chuck Cranor spurred on this improvement.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# 6215e862 08-Mar-1998 John Dyson <dyson@FreeBSD.org>

Remove a very ill advised vm_page_protect. This was being called
for a non-managed page. That is a big no-no.


# edd97f3a 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

Several minor fixes:
1) When freeing pages, it is a good idea to protect them off.
(This is probably gratuitious, but good form.)
2) Allow collapsing pages in the backing object that are
PQ_CACHE. This will improve memory utilization.
3) Correct the collapse code so that pages that were on the
cache queue are moved to the inactive queue. This is
done when pages are marked dirty (so that those pages
will be properly paged out instead of freed), so that
cached pages will not be paradoxically marked dirty.


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# 66095752 24-Feb-1998 John Dyson <dyson@FreeBSD.org>

Fix page prezeroing for SMP, and fix some potential paging-in-progress
hangs. The paging-in-progress diagnosis was a result of Tor Egge's
excellent detective work.
Submitted by: Partially from Tor Egge.


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# eaf13dd7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# bf27292b 06-Jan-1998 John Dyson <dyson@FreeBSD.org>

Turn off the VTEXT flag when an object is no longer referenced, so
that an executable that is no longer running can be written to. Also,
clear the OBJ_OPT flag more often, when appropriate.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 1efb74fb 19-Dec-1997 John Dyson <dyson@FreeBSD.org>

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


# fe0dd4ac 18-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #include of <sys/malloc.h>. This file now uses only
zalloc(). Many more cases like this are probably obscured by not
including <vm/zone.h> explicitly (it is spammed into <sys/malloc.h>).


# 0abc78a6 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Rename some local variables to avoid shadowing other local variables.

Found by: -Wshadow


# 0a80f406 24-Oct-1997 John Dyson <dyson@FreeBSD.org>

Decrease the initial allocation for the zone allocations.


# 99448ed1 20-Sep-1997 John Dyson <dyson@FreeBSD.org>

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


# 79624e21 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 4de628de 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Some staticized variables were still declared to be extern.


# 3075778b 04-Aug-1997 John Dyson <dyson@FreeBSD.org>

Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of
a simple, clean zone type allocator. This new allocator will also be
used for machine dependent pmap PV entries.


# 3b18caba 22-Jun-1997 Peter Wemm <peter@FreeBSD.org>

Kill some stale leftovers from the earlier attempts at SMP per-cpu pages


# 3c631446 21-Jun-1997 John Dyson <dyson@FreeBSD.org>

Remove a window during running down a file vnode. Also, the OBJ_DEAD
flag wasn't being respected during vref(), et. al. Note that this
isn't the eventual fix for the locking problem. Fine grained SMP
in the VM and VFS code will require (lots) more work.


# 0228905a 28-May-1997 Peter Wemm <peter@FreeBSD.org>

Update the #include "opt_smpxxx.h" includes - opt_smp.h isn't needed
very much in the generic parts of the kernel now.


# 477a642c 26-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Man the liferafts! Here comes the long awaited SMP -> -current merge!

There are various options documented in i386/conf/LINT, there is more to
come over the next few days.

The kernel should run pretty much "as before" without the options to
activate SMP mode.

There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.

This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 6e20a165 19-Jan-1997 John Dyson <dyson@FreeBSD.org>

Make MADV_FREE work better. Specifically, it did not wait for
the page to be unbusy, and it caused some algorithmic problems
as a result. There were some other problems with it also, so
this is a general cleanup of the code.
Submitted by: Douglas Crosher <dtc@scrooge.ee.swin.oz.au> and myself.


# afa07f7e 15-Jan-1997 John Dyson <dyson@FreeBSD.org>

Change the map entry flags from bitfields to bitmasks. Allows
for some code simplification.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 106031ef 03-Jan-1997 John Dyson <dyson@FreeBSD.org>

Undo the collapse breakage (swap space usage problem.)


# 3c018e72 31-Dec-1996 John Dyson <dyson@FreeBSD.org>

Guess what? We left alot of the old collapse code that is not needed
anymore with the "full" collapse fix that we added about 1yr ago!!! The
code has been removed by optioning it out for now, so we can put it back
in ASAP if any problems are found.


# 8cc7e047 31-Dec-1996 John Dyson <dyson@FreeBSD.org>

A very significant improvement in the management of process maps
and objects. Previously, "fancy" memory management techniques
such as that used by the M3 RTS would have the tendancy of chopping
up processes allocated memory into lots of little objects. Alan
has come up with some improvements to migtigate the sitution to
the point where even the M3 RTS only has one object for bss and
it's managed memory (when running CVSUP.) (There are still cases where the
situation isn't improved when the system pages -- but this is much much
better for the vast majority of cases.) The system will now be able
to much more effectively merge map entries.

Submitted by: Alan Cox <alc@cs.rice.edu>


# a2f4a846 27-Sep-1996 John Dyson <dyson@FreeBSD.org>

Reviewed by:
Submitted by:
Obtained from:


# c7c34a24 14-Sep-1996 Bruce Evans <bde@FreeBSD.org>

Attached vm ddb commands `show map', `show vmochk', `show object',
`show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff
to the ends of the vm source files.

Changed printf() to db_printf(), `indent' to db_indent, and iprintf()
to db_iprintf() in ddb commands. Moved db_indent and db_iprintf()
from vm to ddb.

vm_page.c:
Don't use __pure. Staticized.

db_output.c:
Reduced page width from 80 to 79 to inhibit double spacing for long
lines (there are still some problems if words are printed across
column 79).


# 5070c7f8 08-Sep-1996 John Dyson <dyson@FreeBSD.org>

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# b5b40fa6 16-Jun-1996 John Dyson <dyson@FreeBSD.org>

Various bugfixes/cleanups from me and others:
1) Remove potential race conditions on waking up in vm_page_free_wakeup
by making sure that it is at splvm().
2) Fix another bug in vm_map_simplify_entry.
3) Be more complete about converting from default to swap pager
when an object grows to be large enough that there can be
a problem with data structure allocation under low memory
conditions.
4) Make some madvise code more efficient.
5) Added some comments.


# f35329ac 30-May-1996 John Dyson <dyson@FreeBSD.org>

This commit is dual-purpose, to fix more of the pageout daemon
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.


# 3077a9c2 23-May-1996 John Dyson <dyson@FreeBSD.org>

Eliminate inefficient check for dirty pages for pages in the PQ_CACHE
queue. Also, modify the MADV_FREE policy (it probably still isn't the final
version.)


# 0a47b48b 22-May-1996 John Dyson <dyson@FreeBSD.org>

Initial support for MADV_FREE, support for pages that we don't care
about the contents anymore. This gives us alot of the advantage of
freeing individual pages through munmap, but with almost none of the
overhead.


# 4a62209c 21-May-1996 John Dyson <dyson@FreeBSD.org>

After reviewing the previous commit to vm_object, the page protection
is never necessary, not just for PG_FICTICIOUS.


# 07c647c5 20-May-1996 John Dyson <dyson@FreeBSD.org>

Don't protect non-managed pages off during object rundown. This fixes
a hang that occurs under certain circumstances when exiting X.


# 867a482d 19-May-1996 John Dyson <dyson@FreeBSD.org>

Initial support for mincore and madvise. Both are almost fully
supported, except madvise does not page in with MADV_WILLNEED, and
MADV_DONTNEED doesn't force dirty pages out.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# 0891ef4c 23-Apr-1996 John Dyson <dyson@FreeBSD.org>

This fixes kmem_malloc/kmem_free (and malloc/free of objects of > 8K).
A page index was calculated incorrectly in vm_kern, and vm_object_page_remove
removed pages that should not have been.


# 46268a60 28-Mar-1996 David Greenman <dg@FreeBSD.org>

Revert to previous calculation of vm_object_cache_max: it simply works
better in most real-world cases.


# 30dcfc09 27-Mar-1996 John Dyson <dyson@FreeBSD.org>

VM performance improvements, and reorder some operations in VM fault
in anticipation of a fix in pmap that will allow the mlock system call to work
without panicing the system.


# 8169788f 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 1b67ec6d 10-Mar-1996 Jeffrey Hsu <hsu@FreeBSD.org>

For Lite2: proc LIST changes.
Reviewed by: davidg & bde


# de5f6a77 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 0e41ee30 04-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert DDB to new-style option.


# a2d5b142 04-Jan-1996 David Greenman <dg@FreeBSD.org>

Increased vm_object_cache_max by about 50% to yield better utilization of
memory when lots of small files are cached.

Reviewed by: dyson


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# cac597e4 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.

Staticized some functions.

__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.


# 3af76890 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused vars & funcs, make things static, protoize a little bit.


# aef922f5 05-Nov-1995 John Dyson <dyson@FreeBSD.org>

Greatly simplify the msync code. Eliminate complications in vm_pageout
for msyncing. Remove a bug that manifests itself primarily on NFS
(the dirty range on the buffers is not set on msync.)


# e17bed12 22-Oct-1995 John Dyson <dyson@FreeBSD.org>

First phase of removing the PG_COPYONWRITE flag, and an architectural
cleanup of mapping files.


# 187f0238 26-Aug-1995 Bruce Evans <bde@FreeBSD.org>

Change vm_object_print() to have the correct number and type of args
for a ddb command.


# bf25be48 16-Aug-1995 Bruce Evans <bde@FreeBSD.org>

Make everything except the unsupported network sources compile cleanly
with -Wnested-externs.


# 28f8db14 29-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 2a4895f4 16-Jul-1995 David Greenman <dg@FreeBSD.org>

1) Merged swpager structure into vm_object.
2) Changed swap_pager internal interfaces to cope w/#1.
3) Eliminated object->copy as we no longer have copy objects.
4) Minor stylistic changes.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# d3628763 11-Jun-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Merge RELENG_2_0_5 into HEAD


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 61f5d510 21-May-1995 David Greenman <dg@FreeBSD.org>

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# f325917a 01-May-1995 David Greenman <dg@FreeBSD.org>

Changed object hash list to be a list rather than a tailq. This saves
space for the hash list buckets and is a little faster. The features
of tailq aren't needed. Increased the size of the object hash table
to improve performance. In the future, this will be changed so that
the table is sized dynamically.


# 7e15fd27 20-Apr-1995 John Dyson <dyson@FreeBSD.org>

Fixed a problem in _vm_object_page_clean that could cause an
infinite loop.


# c3cb3e12 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


# ec4f9fb0 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Fixed a few bugs in vm_object_page_clean, mostly related to not syncing
pages that are in FS buffers. This fixes the (believed to already have been
fixed) problem with msync() not doing it's job...in other words, the
stuff that Andrew has continuously been complaining about.

Submitted by: John Dyson, w/minor changes by me.


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 260295f9 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed (almost) meaningless "object cache lookups/hits" statistic. In
our framework, these numbers will usually be nearly the same, and not
because of any sort of high 'hit rate'.


# a3429799 24-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed cnt.v_nzfod: In our current scheme of things it is not possible
to accurately track this. It isn't an indicator of resource consumption
anyway.
Removed cnt.v_kernel_pages: We don't implement this and doing so accurately
would be very difficult (and ambiguous - since process pages are often
double mapped in the kernel and the process address spaces).


# d7a0fc93 22-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed warning caused by returning a value in a void function (introduced
in a recent commit by me). Relaxed checks before calling vm_object_remove;
a non-internal object always has a pager.


# f5cf85d4 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed unused fifth argument to vm_object_page_clean(). Fixed bug with
VTEXT not always getting cleared when it is supposed to. Added check to
make sure that vm_object_remove() isn't called with a NULL pager or for
a pager for an OBJ_INTERNAL object (neither of which will be on the hash
list). Clear OBJ_CANPERSIST if we decide to terminate it because of no
resident pages.


# 563128e4 22-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed potential sleep/wakeup race conditional with splhigh().

Submitted by: John Dyson


# 7c1f6ced 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a new boolean argument to vm_object_page_clean that causes it to
only toss out clean pages if TRUE.


# 0426122f 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't gain/lose an object reference in vnode_pager_setsize(). It will
cause vnode locking problems in vm_object_terminate().
Implement proper vnode locking in vm_object_terminate().


# 83edfd47 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed an unnecessary call to vinvalbuf after the page clean.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# c4ed5a07 12-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed obsolete comment.


# 61ca29b0 12-Mar-1995 David Greenman <dg@FreeBSD.org>

Deleted vm_object_setpager().


# be6d5bfa 07-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't attempt to reverse collapse non OBJ_INTERNAL objects.


# f919ebde 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# a7ac758e 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Removed bogus copy object collapse check (the idea is right, but the
spcific check was bogus).
Removed old copy of vm_object_page_clean and took out the #if 1 around
the remaining one.

Submitted by: John Dyson


# c0503609 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Only do object paging_in_progress wakeups if someone is waiting on this
condition.

Submitted by: John Dyson


# 7fb0c17e 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_
pager(). Almost completely rewrote vm_mmap(); when John gets done with
the bottom half, it will be a complete rewrite. Deprecated most use of
vm_object_setpager(). Removed side effect of setting object persist
in vm_object_enter and moved this into the pager(s). A few other
cosmetic changes.


# ba8da839 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Panic if object is deallocated too many times.
Slight change to reverse collapsing so that vm_object_deallocate doesn't
have to be called recursively.
Removed half of a previous fix - the renamed page during a collapse doesn't
need to be marked dirty because the pager backing store pointers are copied
- thus preserving the page's data. This assumes that pages without backing
store are always dirty (except perhaps for when they are first zeroed, but
this doesn't matter).
Switch order of two lines of code so that the correct pager is removed
from the hash list. The previous code bogusly passed a NULL pointer to
vm_object_remove(). The call to vm_object_remove() should be unnecessary
if named anonymous objects were being dealt with correctly. They are
currently marked as OBJ_INTERNAL, which really screws up things (such as
this).


# 9b4814bb 17-Feb-1995 David Greenman <dg@FreeBSD.org>

1) Added protection against collapsing OBJ_DEAD objects.
2) bump reference counts by 2 instead of 1 so that an object deallocate
doesn't try to recursively collapse the object.
3) mark pages renamed during the collapse as dirty so that their contents
are preserved.

Submitted by: John and me.


# 0217125f 12-Feb-1995 David Greenman <dg@FreeBSD.org>

Carefully choose the value for vm_object_cache_max. The previous calculation
was rather bogus in most cases; the new value works very well for both
large and small memory machines.


# a1f6d91c 02-Feb-1995 David Greenman <dg@FreeBSD.org>

swap_pager.c:
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.

vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).

vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.

kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.

vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.

vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.

vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).

vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.

Submitted by: John Dyson


# a465acda 25-Jan-1995 David Greenman <dg@FreeBSD.org>

Don't attempt to clean device_pager backed objects at terminate time.
There is similar bogusness in the pageout daemon that will be fixed soon.
This fixes a panic pointed out to me by Bruce Evans that occurs when
/dev/mem is used to map managed memory.


# 6d40c3d3 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# b9921222 13-Jan-1995 David Greenman <dg@FreeBSD.org>

Protect a qcollapse call with an object lock before calling. The locks
need to be moved into the qcollapse and rcollapse routines, but I don't
have time at the moment to make all the required changes...this will do
for now.


# 8b4dd3c4 11-Jan-1995 David Greenman <dg@FreeBSD.org>

Improve my previous change to use the same tests as are used in qcollapse.


# a7489784 11-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed a panic that Garrett reported to me...the OBJ_INTERNAL flag wasn't
being cleared in some cases for vnode backed objects; we now do this in
vnode_pager_alloc proper to guarantee it. Also be more careful in the
rcollapse code about messing with busy/bmapped pages.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 010cf3b9 04-Jan-1995 David Greenman <dg@FreeBSD.org>

Make sure that the object being collapsed doesn't go away on us...by
gaining extra references to it.

Submitted by: John Dyson
Obtained from:


# 45cbbb29 22-Dec-1994 David Greenman <dg@FreeBSD.org>

Do vm_page_rename more conservatively in rcollapse and qcollapse, and
change list walk so that it doesn't get stuck in an infinite loop.

Submitted by: John Dyson


# 7b18a718 10-Dec-1994 David Greenman <dg@FreeBSD.org>

Don't put objects that have no parent on the reverse_shadow_list. Problem
identified and explained by Gene Stark (thanks Gene!).

Submitted by: John Dyson


# eadf9e27 25-Nov-1994 David Greenman <dg@FreeBSD.org>

These changes fix a couple of lingering VM problems:

1. The pageout daemon used to block under certain
circumstances, and we needed to add new functionality
that would cause the pageout daemon to block more often.
Now, the pageout daemon mostly just gets rid of pages
and kills processes when the system is out of swap.
The swapping, rss limiting and object cache trimming
have been folded into a new daemon called "vmdaemon".
This new daemon does things that need to be done for
the VM system, but can block. For example, if the
vmdaemon blocks for memory, the pageout daemon
can take care of it. If the pageout daemon had
blocked for memory, it was difficult to handle
the situation correctly (and in some cases, was
impossible).

2. The collapse problem has now been entirely fixed.
It now appears to be impossible to accumulate unnecessary
vm objects. The object collapsing now occurs when ref counts
drop to one (where it is more likely to be more simple anyway
because less pages would be out on disk.) The original
fixes were incomplete in that pathological circumstances
could still be contrived to cause uncontrolled growth
of swap. Also, the old code still, under steady state
conditions, used more swap space than necessary. When
using the new code, users will generally notice a
significant decrease in swap space usage, and theoretically,
the system should be leaving fewer unused pages around
competing for memory.

Submitted by: John Dyson


# 2fe6e4d7 05-Nov-1994 David Greenman <dg@FreeBSD.org>

Added support for starting the experimental "vmdaemon" system process.
Enabled via REL2_1.

Added support for doing object collapses "on the fly". Enabled via REL2_1a.

Improved object collapses so that they can happen in more cases. Improved
sensing of modified pages to fix an apparant race condition and improve
clustered pageout opportunities. Fixed an "oops" with not restarting page
scan after a potential block in vm_pageout_clean() (not doing this can result
in strange behavior in some cases).

Submitted by: John Dyson & David Greenman


# a08a17a3 15-Oct-1994 David Greenman <dg@FreeBSD.org>

Properly count object lookups and hits.


# 05f0fdd2 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# 8a129cae 27-Aug-1994 David Greenman <dg@FreeBSD.org>

1) Changed ddb into a option rather than a pseudo-device (use options DDB
in your kernel config now).
2) Added ps ddb function from 1.1.5. Cleaned it up a bit and moved into its
own file.
3) Added \r handing in db_printf.
4) Added missing memory usage stats to statclock().
5) Added dummy function to pseudo_set so it will be emitted if there
are no other pseudo declarations.


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# a481f200 07-Aug-1994 David Greenman <dg@FreeBSD.org>

Provide support for upcoming merged VM/buffer cache, and fixed a few bugs
that haven't appeared to manifest themselves (yet).

Submitted by: John Dyson


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources