History log of /freebsd-current/sys/vm/vm_page.h
Revision Date Author Comments
# cb20a74c 03-Apr-2024 Stephen J. Kiernan <stevek@FreeBSD.org>

vm: add macro to mark arguments used when NUMA is defined

This fixes compiler warnings when -Wunused-arguments is enabled and
not quieted.

Reviewed by: kib, markj
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D44623


# 0ee1cd6d 23-Dec-2023 Jason A. Harmening <jah@FreeBSD.org>

vm_page.h: tweak page-busied assertion macros

Fix incorrect macro name and include the value of curthread in the
panic message where relevant.


# 2619c5cc 20-Nov-2023 Jason A. Harmening <jah@FreeBSD.org>

Avoid waiting on physical allocations that can't possibly be satisfied

- Change vm_page_reclaim_contig[_domain] to return an errno instead
of a boolean. 0 indicates a successful reclaim, ENOMEM indicates
lack of available memory to reclaim, with any other error (currently
only ERANGE) indicating that reclamation is impossible for the
specified address range. Change all callers to only follow
up with vm_page_wait* in the ENOMEM case.

- Introduce vm_domainset_iter_ignore(), which marks the specified
domain as unavailable for further use by the iterator. Use this
function to ignore domains that can't possibly satisfy a physical
allocation request. Since WAITOK allocations run the iterators
repeatedly, this avoids the possibility of infinitely spinning
in domain iteration if no available domain can satisfy the
allocation request.

PR: 274252
Reported by: kevans
Tested by: kevans
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42706


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 95ee2897 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: two-line .h pattern

Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/


# 9e817428 16-Jun-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys: add binary segment search

Replace several sequential searches for a segment that contains a
phyiscal address with a call to a function that does it by binary
search. In vm_page_reclaim_contig_domain_ext, find the first segment
to reclaim from, and reclaim from each subsequent appropriate segment.
Eliminate vm_phys_scan_contig.

Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D40058


# 8b0dafdb 08-May-2023 Andrew Gallatin <gallatin@FreeBSD.org>

vm: implement vm_page_reclaim_contig_domain_ext()

Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple
contiguous regions at once. This makes it more efficient for users
that need multiple contiguous regions to reclaim those regions
efficiently.

This is needed because callers like ktls may need to reclaim many
contiguous regions, and each scan of physical memory can take
multiple seconds on a large memory machine (order of 100GB of
RMA). Rather than modifying the core algorithm, I extended
vm_page_reclaim_contig_domain() to take a "desired_runs" argument to
allow the caller to request that it reclaim more than just a single
run. There is no functional change intended for all existing
callers.

The first user for this interface is the ktls code
(https://reviews.freebsd.org/D39421). By reclaiming multiple runs,
ktls goes from consuming hours of CPU to refill its buffer zone to
just seconds or minutes.

Differential Revision: https://reviews.freebsd.org/D39739
Sponsored by: Netflix
Reviewed by: alc, jhb, markj


# 934bfc12 17-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

Add vm_page_any_valid()

Use it and several other vm_page_*_valid() functions in more places.

Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37024


# d950c589 28-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

vm/vm_extern.h, vm/vm_page.h: use sys/kassert.h

instead of fatty sys/systm.h.

Suggested by: jhb
Reviewed by: alc, imp, jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34089


# a2665158 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Remove vm_page_sbusy() and vm_page_xbusy()

They are unused today and cannot be safely used in the face of unlocked
lookup, in which pages may be busied without the object lock held.

Obtained from: jeff (object_concurrency patches)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32948


# 87b64663 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Consolidate page busy sleep mechanisms

- Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take
a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix
up callers.
- Modify vm_page_busy_sleep() to return a status indicating whether the
object lock was dropped, and fix up callers.
- Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep()
instead.
- Remove vm_page_sleep_if_(x)busy().

No functional change intended.

Obtained from: jeff (object_concurrency patches)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32947


# a9d6f1fe 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Remove some remaining references to VM_ALLOC_NOOBJ

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32037


# c40cf9bc 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Stop handling VM_ALLOC_NOOBJ in vm_page_alloc_domain_after()

This makes the allocator simpler since it can assume object != NULL.
Also modify the function to unconditionally preserve PG_ZERO, so
VM_ALLOC_ZERO is effectively ignored (and still must be implemented by
the caller for now).

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32033


# 92db9f3b 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_alloc_noobj_contig()

This is the same as vm_page_alloc_noobj(), but allocates physically
contiguous runs of memory. For now it is implemented in terms of
vm_page_alloc_contig(), with the difference that
vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the
page.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32005


# b498f71b 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Add a new page allocator interface for unnamed pages

The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain().
These mostly correspond to vm_page_alloc() and vm_page_alloc_domain()
when no VM object is specified, with the exception that they handle
VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO.

This simplifies callers and will permit simplification of the
vm_page_alloc_domain() definition.

Since the new allocator variant is similar to vm_page_alloc_freelist(),
implement both of them using a common backend allocator function. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31985


# 5b10e79e 17-Jun-2021 Konstantin Belousov <kib@FreeBSD.org>

Un-staticise vm_page_init_page()

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30785


# 660344ca 29-Jan-2021 Ryan Stone <rstone@FreeBSD.org>

Add a VM flag to prevent reclaim on a failed contig allocation

If a M_WAITOK contig alloc fails, the VM subsystem will try to
reclaim contiguous memory twice before actually failing the
request. On a system with 64GB of RAM I've observed this take
400-500ms before it finally gives up, and I believe that this
will only be worse on systems with even more memory.

In certain contexts this delay is extremely harmful, so add a flag
that will skip reclaim for allocation requests to allow those
paths to opt-out of doing an expensive reclaim.

Sponsored by: Dell Inc
Differential Revision: https://reviews.freebsd.org/D28422
Reviewed by: markj, kib


# 431fb8ab 18-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vm_phys: Try to clean up NUMA KPIs

It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207


# 6f3b523c 14-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

Avoid dump_avail[] redefinition.

Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.

Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741


# c2c6fb90 09-Oct-2020 Bryan Drewery <bdrewery@FreeBSD.org>

Use unlocked page lookup for inmem() to avoid object lock contention

Reviewed By: kib, markj
Submitted by: mlaier
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D26653


# 42f96162 08-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

vm_page_dump_index_to_pa(): Add braces to the expression involving + and &.

The precedence of the '&' operator is less than of '+'. Added braces
do change the order of evaluation into the natural one, in my opinion.
On the other hand, the value of the expression should not change since
all elements should have page-aligned values.

This fixes a gcc warning reported.

Reported by: adrian
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 00e66147 21-Sep-2020 D Scott Phillips <scottph@FreeBSD.org>

Sparsify the vm_page_dump bitmap

On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).

Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.

Reviewed by: jhb
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26131


# ab041f71 21-Sep-2020 D Scott Phillips <scottph@FreeBSD.org>

Move vm_page_dump bitset array definition to MI code

These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.

Reviewed by: kib, markj
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26129


# 0292c54b 11-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

Add support for multithreading the inactive queue pageout within a domain.

In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing. Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon. It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.

I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems. For now I would like to commit this infrastructure with only a
single thread enabled.

The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.

Submitted by: jeff (earlier version)
Discussed with: markj
Tested by: pho
Sponsored by: probably Netflix (based on contemporary commits)
Differential Revision: https://reviews.freebsd.org/D21629


# efec381d 04-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove most lingering references to the page lock in comments.

Finish updating comments to reflect new locking protocols introduced
over the past year. In particular, vm_page_lock is now effectively
unused.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25868


# 958d8f52 29-Jul-2020 Mark Johnston <markj@FreeBSD.org>

Remove the volatile qualifier from busy_lock.

Use atomic(9) to load the lock state. Some places were doing this
already, so it was inconsistent. In initialization code, the lock state
is still initialized with plain stores.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25861


# f72e5be5 28-Jul-2020 Mark Johnston <markj@FreeBSD.org>

vm_page_xbusy_claim(): Use atomics to update busy lock state.

vm_page_xbusy_claim() could clobber the waiter bit. For its original
use, kernel memory pages, this was not a problem since nothing would
ever block on the busy lock for such pages. r363607 introduced a new
use where this could in principle be a problem.

Fix the problem by using atomic_cmpset to update the lock owner. Since
this macro is defined only for INVARIANTS kernels the extra overhead
doesn't seem prohibitive.

Reported by: vangyzen
Reviewed by: alc, kib, vangyzen
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25859


# 4dfa06e1 17-Jul-2020 Chuck Silvers <chs@FreeBSD.org>

Add a new function vm_page_free_invalid() for freeing invalid pages
that might be wired. If the page is wired then it cannot be freed now,
but the thread that eventually unwires it will free it at that point.

Reviewed by: markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25430


# ffc568ba 09-Jul-2020 Scott Long <scottl@FreeBSD.org>

Revert r362998, r326999 while a better compatibility strategy is devised.


# b302c2e5 07-Jul-2020 Scott Long <scottl@FreeBSD.org>

Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after: 1 week


# a9ea09e5 28-Apr-2020 Mark Johnston <markj@FreeBSD.org>

Re-check for wirings after busying the page in vm_page_release_locked().

A concurrent unlocked lookup can wire the page after
vm_page_release_locked() releases the last wiring, in which case
vm_page_release_locked() must not free the page. Once the xbusy lock is
acquired, that, the object lock and the fact that the page is unmapped
ensure that the wire count cannot increase, so re-check for new wirings
after the page is xbusied.

Update the comment above vm_page_wired() to reflect the new
synchronization rules.

Reported by: glebius
Reviewed by: alc, jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24592


# 6be21eb7 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23866


# c49be4f1 26-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Add unlocked grab* function variants that use lockless radix code to
lookup pages. These variants will fall back to their locked counterparts
if the page is not present.

Discussed with: kib, markj
Differential Revision: https://reviews.freebsd.org/D23449


# e9ceb9dd 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't release xbusy on kmem pages. After lockless page lookup we will not
be able to guarantee that they can be racquired without blocking.

Reviewed by: kib
Discussed with: markj
Differential Revision: https://reviews.freebsd.org/D23506


# f212367b 16-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Refactor _vm_page_busy_sleep to reduce the delta between the various
sleep routines and introduce a variant that supports lockless sleep.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23612


# ee9e43f8 04-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Add an explicit busy state for free pages. This improves behavior with
potential bugs that access freed pages as well as providing a path
towards lockless page lookup.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23444


# f0a273c0 01-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Remove a couple of lingering usages of the page lock.

Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using
vm_page_change_lock(). It has no use after r356157. Remove
vm_page_change_lock() now that it has no users.

Remove an unncessary check for wirings in vm_page_scan_contig(), which
was previously checking twice. The check is racy until
vm_page_reclaim_run() ensures that the page is unmapped, so one check is
sufficient.

Reviewed by: jeff, kib (previous versions)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23279


# 727150ff 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove some unused functions.

The previous series of patches orphaned some vm_page functions, so
remove them.

Reviewed by: dougm, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22886


# dc71caa0 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Update the vm_page.h block comment to reflect recent changes.

Explain the new locking rules for per-page queue state updates.

Reviewed by: jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22884


# 9f5632e6 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking for queue operations.

With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885


# f3f38e25 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Start implementing queue state updates using fcmpset loops.

This is in preparation for eliminating the use of the vm_page lock for
protecting queue state operations.

Introduce the vm_page_pqstate_commit_*() functions. These functions act
as helpers around vm_page_astate_fcmpset() and are specialized for
specific types of operations. vm_page_pqstate_commit() wraps these
functions.

Convert a number of routines to use these new helpers. Use
vm_page_release_toq() in vm_page_unwire() and vm_page_release() to
atomically release a wiring reference and release the page into a queue.
This has the side effect that vm_page_unwire() will leave the page in
the active queue if it is already present there.

Convert the page queue scans to use the new helpers. Simplify
vm_pageout_reinsert_inactive(), which requeues pages that were found to
be busy during an inactive queue scan, to avoid duplicating the work of
vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or
PGA_REQUEUE_HEAD is set, let that be handled during batch processing.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22770
Differential Revision: https://reviews.freebsd.org/D22771
Differential Revision: https://reviews.freebsd.org/D22772
Differential Revision: https://reviews.freebsd.org/D22773
Differential Revision: https://reviews.freebsd.org/D22776


# 3cf3b4e6 21-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Make page busy state deterministic on free. Pages must be xbusy when
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.

Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.

Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822


# c2f22e97 17-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Fix the aflag shift on big-endian platforms after r355672.

The structure offset is zero regardless of endianness.

Reported by: brooks
Pointy hat: markj


# a8081778 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654


# cbc080b4 12-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Avoid relying on silent type casting in the native atomic_load_32.

Reported by: np


# 6fbaf685 12-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Implement atomic state updates using the new vm_page_astate_t structure.

Introduce primitives vm_page_astate_load() and vm_page_astate_fcmpset()
to operate on the 32-bit per-page atomic state. Modify
vm_page_pqstate_fcmpset() to use them. No functional change intended.

Introduce PGA_QUEUE_OP_MASK, a subset of PGA_QUEUE_STATE_MASK that only
includes queue operation flags. This will be used in subsequent
patches.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22753


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# fb1d575c 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Reduce duplication in grab functions by providing allocflags based inlines.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22635


# 584061b4 28-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab. Remove some nearby
dead code.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22564


# b631c36f 24-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Record part of the owner struct thread pointer into busy_lock.

Record as much bits from curthread into busy_lock as fits. Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0). Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler. For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22298


# 7f935055 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Remove unnecessary object locking from the vnode pager. Recent changes to
busy/valid/dirty locking make these acquires redundant.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22186


# 3a2ba997 18-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Widen the vm_page aflags field to 16 bits.

We are now out of aflags bits, whereas the "flags" field only makes use
of five of its sixteen bits, so narrow "flags" to eight bits. I have no
intention of adding a new aflag in the near future, but would like to
combine the aflags, queue and act_count fields into a single atomically
updated word. This will allow vm_page_pqstate_cmpset() to become much
simpler and is a step towards eliminating the use of the page lock array
in updating per-page queue state.

The change modifies the layout of struct vm_page, so bump
__FreeBSD_version.

Reviewed by: alc, dougm, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22397


# 01cef4ca 16-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking from pmap_mincore().

After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823


# fff5403f 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the
object lock in vm_page_set_validclean().

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21595


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 205be21d 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.

This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# b119329d 25-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Complete the removal of the "wire_count" field from struct vm_page.

Convert all remaining references to that field to "ref_count" and update
comments accordingly. No functional change intended.

Reviewed by: alc, kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21768


# 38547d59 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Assert that the refcount value is not VPRC_BLOCKED in vm_page_drop().

VPRC_BLOCKED is a refcount flag used to indicate that a thread is
tearing down mappings of a page. When set, it causes attempts to wire a
page via a pmap lookup to fail. It should never represent the last
reference to a page, so assert this.

Suggested by: kib
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# 923da43e7 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a race in vm_page_dequeue_deferred_free() after r352110.

This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index. Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.

Reported and tested by: pho
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# c7575748 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.

Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546


# 4cdea4a8 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 7cdeaf33 03-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Add preliminary support for atomic updates of per-page queue state.

Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page. Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.

Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon. In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks. However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required. vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.

Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset(). Use it in the active queue scan, which
does not use the object lock. Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.

Reviewed by: kib
Discussed with: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21257


# 386eba08 23-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Make vm_pqbatch_submit_page() externally visible.

It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent. No functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21369


# 8b90607f 21-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Simplify vm_page_dequeue() and fix an assertion.

- Add a vm_pagequeue_remove() function to physically remove a page
from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
from the page allocator with the page unlocked. The assertion
m->order == VM_NFREEORDER simply tests whether the page has been
removed from the vm_phys free lists; instead, check whether the
page belongs to an object.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21341


# 98549e2d 29-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().

Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986


# eeacb3b0 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Merge the vm_page hold and wire mechanisms.

The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended. __FreeBSD_version is bumped.

Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247


# d9a73522 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Add a per-CPU page cache per VM free pool.

Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.

Reviewed by: dougm, kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20858


# 9f74cdbf 02-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Mark pages allocated from the per-CPU cache.

Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase. In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break. Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.

The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.

Reviewed by: alc
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20763


# 0fd977b3 26-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a return value to vm_page_remove().

Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758


# d842aa51 01-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a vm_page_wired() predicate.

Use it instead of accessing the wire_count field directly. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485


# 1e2b3e6f 03-Feb-2019 Mark Johnston <markj@FreeBSD.org>

Allow vm_page_free_prep() to dequeue pages without the page lock.

This is a step towards being able to free pages without the page
lock held. The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held. Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.

No functional change intended.

Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19065


# 21f01f45 06-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Remove vm_page_remque().

Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276. As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.

Reviewed by: kib
Tested by: pho
Approved by: re (marius)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17025


# 49bfa624 25-Aug-2018 Alan Cox <alc@FreeBSD.org>

Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845


# 99d92d73 23-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Ensure that queue state is cleared when vm_page_dequeue() returns.

Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held. When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held. Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.

Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.

While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked(). Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.

Reported and tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15980


# f4b36404 02-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

inline atomics and allow tied modules to inline locks

- inline atomics in modules on i386 and amd64 (they were always
inline on other arches)
- allow modules to opt in to inlining locks by specifying
MODULE_TIED=1 in the makefile

Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079


# ba2b3349 16-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix a race in vm_page_pagequeue_lockptr().

The value of m->queue must be cached after comparing it with PQ_NONE,
since it may be concurrently changing.

Reported by: glebius
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D15462


# 1b5c869d 04-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix some races introduced in r332974.

With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280


# 5cd29d0f 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Improve VM page queue scalability.

Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893


# 64b38930 19-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Initialize marker pages in vm_page_domain_init().

They were previously initialized by the corresponding page daemon
threads, but for vmd_inacthead this may be too late if
vm_page_deactivate_noreuse() is called during boot.

Reported and tested by: cperciva
Reviewed by: alc, kib
MFC after: 1 week


# e55d32b7 07-Apr-2018 Konstantin Belousov <kib@FreeBSD.org>

Handle Skylake-X errata SKZ63.

SKZ63 Processor May Hang When Executing Code In an HLE Transaction
Region

Problem: Under certain conditions, if the processor acquires an HLE
(Hardware Lock Elision) lock via the XACQUIRE instruction in the Host
Physical Address range between 40000000H and 403FFFFFH, it may hang
with an internal timeout error (MCACOD 0400H) logged into
IA32_MCi_STATUS.

Move the pages from the range into the blacklist. Add a tunable to
not waste 4M if local DoS is not the issue.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15001


# 8c8ee2ee 04-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Unify bulk free operations in several pmaps.

Submitted by: Yoshihiro Ota
Reviewed by: markj
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13485


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# 4d265352 06-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Delete a declaration for a variable removed in r305362.


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 796df753 30-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

SPDX: Consider code from Carnegie-Mellon University.

Interesting cases, most likely from CMU Mach sources.


# ef435ae7 28-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Move domain iterators into the page layer where domain selection should take
place. This makes the majority of the phys layer explicitly domain specific.

Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix & Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13014


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 8d6fbbb8 07-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 494c6e43 24-Sep-2017 Alan Cox <alc@FreeBSD.org>

Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all()
can be avoided when the page's containing object has a reference count of
zero. (If the object has a reference count of zero, then none of its pages
can possibly be mapped.)

Address nearby style issues in vm_page_try_to_free(), and change its
return type to "bool".

Reviewed by: kib, markj
MFC after: 1 week


# 540ac3b3 13-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Split vm_page_free_toq() into two parts, preparation vm_page_free_prep()
and insertion into the phys allocator free queues vm_page_free_phys().
Also provide a wrapper vm_page_free_phys_pglist() for batched free.

Reviewed by: alc, markj
Tested by: mjg (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# b9e8fb64 13-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Use existing tag name for the vm_object' memq.

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 93c5d3a4 09-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Add a vm_page_change_lock() helper, the common code to not relock page
lock if both old and new pages use the same underlying lock. Convert
existing places to use the helper instead of inlining it. Use the
optimization in vm_object_page_remove().

Suggested and reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 33fff5d5 15-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Add vm_page_alloc_after().

This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.

Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().

Suggested by: alc
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11984


# 9df950b3 11-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.

This will allow its use in sendfile_swapin().

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11942


# 5471caf6 08-Aug-2017 Alan Cox <alc@FreeBSD.org>

Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926


# 88302601 13-Jul-2017 Alan Cox <alc@FreeBSD.org>

Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().

Reviewed by: kib, markj
MFC after: 1 week


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# bfc8c24c 04-Jan-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Move bogus_page declaration to vm_page.h and initialization to vm_page.c.

Reviewed by: kib


# b1fd102e 02-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Add a page queue for holding dirty anonymous unswappable pages.

On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by: alc


# 3453bca8 12-Dec-2016 Alan Cox <alc@FreeBSD.org>

Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system. Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8752


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# ebcddc72 09-Nov-2016 Alan Cox <alc@FreeBSD.org>

Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302


# bd9546a2 17-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Export vm_page_xunbusy_maybelocked().

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D8197


# 5975e53d 13-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix a race in vm_page_busy_sleep(9).

Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196


# 70cf3ced 05-Oct-2016 Alan Cox <alc@FreeBSD.org>

Make the page daemon's notion of what kind of pass is being performed
by vm_pageout_scan() local to vm_pageout_worker(). There is no reason
to store the pass in the NUMA domain structure.

Reviewed by: kib
MFC after: 3 weeks


# dbbaf04f 03-Sep-2016 Mark Johnston <markj@FreeBSD.org>

Remove support for idle page zeroing.

Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by: alc, imp, kib
Differential Revision: https://reviews.freebsd.org/D7714


# 505cd5d1 23-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Add a comment noting locking regime for vm_page_xunbusy().

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# 5a2e650a 19-May-2016 Conrad Meyer <cem@FreeBSD.org>

vm/vm_page.h: Fix trivial '-Wpointer-sign' warning

pq_vcnt, as a count of real things, has no business being negative. It is only
ever initialized by a u_int counter.

The warning came from the atomic_add_int() in vm_pagequeue_cnt_add().

Rectify the warning by changing the variable to u_int. No functional change.

Suggested by: Clang 3.3
Sponsored by: EMC / Isilon Storage Division


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# c869e672 19-Dec-2015 Alan Cox <alc@FreeBSD.org>

Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
memory allocator's free page lists. This replaces the long-standing
approach of scanning the inactive and inactive queues, converting clean
pages into PG_CACHED pages and laundering dirty pages. In contrast, the
new mechanism does not use PG_CACHED pages nor does it trigger a large
number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
free pages in the physical memory allocator's free page lists that are
covered by the direct map. Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places. For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by: kib, markj
Glanced at: jhb
Discussed with: jeff
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4444


# 8170d6e5 17-Dec-2015 Conrad Meyer <cem@FreeBSD.org>

vm_page_replace: add wrapper to KASSERT about old page

It turns out the callers of vm_page_replace know exactly which page they are
replacing and would like to assert about it. Change those from hard panics to
KASSERTs, and provide them with a wrapper so they don't have to deal with
warnings from an INVARIANTS-dependent dead store of the return value of
vm_page_replace.

Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib (earlier version)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4497


# dc62d559 16-Dec-2015 Conrad Meyer <cem@FreeBSD.org>

vm_page.h: page busy macro fixups

Minor changes to:
- delete extraneous trailing semicolons from macro definitions, and
- correct spelling of "busying" in panic messages

Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4577


# 76386c7e 15-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Rework the test which raises OOM condition. Right now, the code
checks for the swap space consumption plus checks that the amount of
the free pages exceeds some limit, in case pagedeamon did not coped
with the page shortage in one of the late passes. This is wrong
because it does not account for the presence of the reclamaible pages
in the queues which are not selectable for reclaim immediately. E.g.,
on the swap-less systems, large active queue easily triggered OOM.

Instead, only raise OOM when pagedaemon is unable to produce a free
page in several back-to-back passes. Track the failed passes per
pagedaemon thread.

The number of passes to trigger OOM was selected empirically and
tested both on small (32M-64M i386 VM) and large (32G amd64)
configurations. If the specifics of the load require tuning, sysctl
vm.pageout_oom_seq sets the number of back-to-back passes which must
fail before OOM is raised. Each pass takes 1/2 of seconds. Less the
value, more sensible the pagedaemon is to the page shortage.

In future, some heuristic to calculate the value of the tunable might
be designed based on the system configuration and load. But before it
can be done, the i/o system must be fixed to reliably time-out
pagedaemon writes, even if waiting for the memory to proceed. Then,
code can account for the in-flight page-outs and postpone OOM until
all of them finished, which should reduce the need in tuning. Right
now, ignoring the in-flight writes and the counter allows to break
deadlocks due to write path doing sleepable memory allocations.

Reported by: Dmitry Sivachenko, bde, many others
Tested by: pho, bde, tuexen (arm)
Reviewed by: alc
Discussed with: bde, imp
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 7e78597f 07-Nov-2015 Mark Johnston <markj@FreeBSD.org>

Ensure that deactivated pages that are not expected to be reused are
reclaimed in FIFO order by the pagedaemon. Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.

Reviewed by: alc
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division


# 7c989c15 22-Oct-2015 Jason A. Harmening <jah@FreeBSD.org>

Fix capitalization


# a5073058 22-Oct-2015 Jason A. Harmening <jah@FreeBSD.org>

Remove unclear comment about address truncation in busdma. Add (hopefully much clearer) comment at declaration of PHYS_TO_VM_PAGE().

Noted by: avg


# 3138cd36 30-Sep-2015 Mark Johnston <markj@FreeBSD.org>

As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726


# 15aaea78 22-Sep-2015 Alan Cox <alc@FreeBSD.org>

Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 22cf98d1 08-Jul-2015 Alan Cox <alc@FreeBSD.org>

The intention of r254304 was to scan the active queue continuously.
However, I've observed the active queue scan stopping when there are
frequent free page shortages and the inactive queue is steadily refilled
by other mechanisms, such as the sequential access heuristic in vm_fault()
or madvise(2). To remedy this problem, record the time of the last active
queue scan, and always scan a number of pages proportional to the time
since the last scan, regardless of whether that last scan was a
timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass >
0") scan.

Also, on a timeout-triggered scan, allow a full scan of the active queue
when the system is short of inactive pages.

Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division


# e3ed82bc 22-Dec-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and
allows the function to fail.

Reviewed by: kib, alc
Sponsored by: Nginx, Inc.


# 89fc8bdb 22-Dec-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Document flags of vm_page allocation functions.

Reviewed by: alc


# afb69e6b 08-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of
pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set
when either the page is busied, or the owner object is locked.

Update comments, move all assertions about page state when
PGA_WRITEABLE flag is set, into new helper
vm_page_assert_pga_writeable().

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 3ae10f74 16-Jun-2014 Attilio Rao <attilio@FreeBSD.org>

- Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# dd05fa19 07-Jun-2014 Alan Cox <alc@FreeBSD.org>

Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 000fb817 31-Dec-2013 Alan Cox <alc@FreeBSD.org>

Since the introduction of the popmap to reservations in r259999, there is
no longer any need for the page's PG_CACHED and PG_FREE flags to be set and
cleared while the free page queues lock is held. Thus, vm_page_alloc(),
vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after
the free page queues lock is released to clear the page's flags. Moreover,
the PG_FREE flag can be retired. Now that the reservation system no longer
uses it, its only uses are in a few assertions. Eliminating these
assertions is no real loss. Other assertions catch the same types of
misbehavior, like doubly freeing a page (see r260032) or dirtying a free
page (free pages are invalid and only valid pages can be dirtied).

Eliminate an unneeded variable from vm_page_alloc_contig().

Sponsored by: EMC / Isilon Storage Division


# 9eab5484 17-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (hrs)


# 3846a822 16-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.

Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# d9e23210 13-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

- Change v_free_target to include the quantity previously represented by
v_cache_min so we don't need to add them together everywhere we use them.
- Add a pageout_wakeup_thresh that sets the free page count trigger for
waking the page daemon. Set this 10% above v_free_min so we wakeup before
any phase transitions in vm users.
- Adjust down v_free_target now that we're willing to accept more pagedaemon
wakeups. This means we process fewer pages in one iteration as well,
leading to shorter lock hold times and less overall disruption.
- Eliminate vm_pageout_page_stats(). This was a minor variation on the
PQ_ACTIVE segment of the normal pageout daemon. Instead we now process
1 / vm_pageout_update_period pages every second. This causes us to visit
the whole active list every 60 seconds. Previously we would only maintain
the active LRU when we were short on pages which would mean it could be
woefully out of date.

Reviewed by: alc (slight variant of this)
Discussed with: alc, kib, jhb
Sponsored by: EMC / Isilon Storage Division


# c325e866 10-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# cdc00bf7 09-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Revert the addition of VPO_BUSY and instead update vm_page_replace() to
properly unbusy the page.

Submitted by: alc


# 5d82a214 09-Aug-2013 David E. O'Brien <obrien@FreeBSD.org>

Add missing 'VPO_BUSY' from r254141 to fix kernel build break.


# e946b949 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 449c2e92 07-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 2051980f 09-Jun-2013 Alan Cox <alc@FreeBSD.org>

Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by: EMC / Isilon Storage Division


# da384208 03-Jun-2013 Alan Cox <alc@FreeBSD.org>

Update a comment.


# b4171812 02-Jun-2013 Alan Cox <alc@FreeBSD.org>

Require that the page lock is held, instead of the object lock, when
clearing the page's PGA_REFERENCED flag. Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag. So, in fact, this revision only changes some
comments and an assertion. Nonetheless, it will enable later changes to
object locking in the pageout code.

Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


# ef5ba5a3 31-May-2013 Alan Cox <alc@FreeBSD.org>

Simplify the definition of vm_page_lock_assert(). There is no compelling
reason to inline the implementation of vm_page_lock_assert() in the
!KLD_MODULES case. Use the same implementation for both KLD_MODULES and
!KLD_MODULES.

Reviewed by: kib


# a15f7df5 08-Apr-2013 Attilio Rao <attilio@FreeBSD.org>

The per-page act_count can be made very-easily protected by the
per-page lock rather than vm_object lock, without any further overhead.
Make the formal switch.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 774d251d 17-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)


# 969a0af0 16-Nov-2012 Alan Cox <alc@FreeBSD.org>

Update a comment to reflect the elimination of the hold queue in r242300.


# 43f48b65 15-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h
to vm/vm_phys.h, where it belongs.

Requested and reviewed by: alc
MFC after: 2 weeks


# 962b064a 15-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion.

Reviewed by: alc
MFC after: 2 weeks


# b32ecf44 14-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by: alc, mdf (previous version)
Tested by: pho (previous version)
MFC after: 2 weeks


# 8d220203 12-Nov-2012 Alan Cox <alc@FreeBSD.org>

Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by: kib


# 4ceaf45d 31-Oct-2012 Attilio Rao <attilio@FreeBSD.org>

Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by: jimharris, alc


# 081a4881 29-Oct-2012 Alan Cox <alc@FreeBSD.org>

Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose. When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is
used as a flag, not a queue. So, this change replaces it with a flag.

To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.

Reviewed by: kib


# 7ecfabc7 13-Oct-2012 Alan Cox <alc@FreeBSD.org>

Move vm_page_requeue() to the only file that uses it.

MFC after: 3 weeks


# b6c00483 14-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by: alc
MFC after: 1 week


# 1c771f92 05-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


# 0055cbd3 04-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by: alc
MFC after: 1 week


# 369763e3 02-Aug-2012 Alan Cox <alc@FreeBSD.org>

Inline vm_page_aflags_clear() and vm_page_aflags_set().

Add comments stating that neither these functions nor the flags that they
are used to manipulate are part of the KBI.


# d26a90a8 30-Jul-2012 Alan Cox <alc@FreeBSD.org>

Eliminate an unneeded declaration. (I should have removed this as part
of r227568.)


# eddc9291 20-Jun-2012 Alan Cox <alc@FreeBSD.org>

Selectively inline vm_page_dirty().


# 6031c68d 16-Jun-2012 Alan Cox <alc@FreeBSD.org>

The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by: kib
MFC after: 6 weeks


# b6de32bd 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# e461aae7 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Split the code from vm_page_getfake() to initialize the fake page struct
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 13a0b7bc 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Commit the change forgotten in r235356.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 1c8279e4 08-Apr-2012 Alan Cox <alc@FreeBSD.org>

Fix mincore(2) so that it reports PG_CACHED pages as resident.

MFC after: 2 weeks


# d1aa86e1 06-Apr-2012 Attilio Rao <attilio@FreeBSD.org>

Staticize vm_page_cache_remove().

Reviewed by: alc


# 57bd5cce 06-Apr-2012 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction
caches, by invalidating kernel icaches only when needed and not flushing
user caches for shared pages.

Suggested by: kib
MFC after: 2 weeks


# 263811f7 27-Jan-2012 Kip Macy <kmacy@FreeBSD.org>

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


# dc874f98 30-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


# cf1911a9 29-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Hide the internals of vm_page_lock(9) from the loadable modules.
Since the address of vm_page lock mutex depends on the kernel options,
it is easy for module to get out of sync with the kernel.

No vm_page_lockptr() accessor is provided for modules. It can be added
later if needed, unless proper KPI is developed to serve the needs.

Reviewed by: attilio, alc
MFC after: 3 weeks


# fbd80bd0 16-Nov-2011 Alan Cox <alc@FreeBSD.org>

Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.

From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.

Reviewed by: kib
Discussed with: jhb


# 7845becf 05-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Remove redundand definitions. The chunk was missed from r227102.

MFC after: 2 weeks


# 561cc9fc 05-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Provide typedefs for the type of bit mask for the page bits.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.

Reviewed by: alc
MFC after: 2 weeks


# 2042bb37 28-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Fix grammar.

Submitted by: bf
MFC after: 2 weeks


# abb9b935 28-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Use the trick of performing the atomic operation on the contained aligned
word to handle the dirty mask updates in vm_page_clear_dirty_mask().
Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold()
the sole purpose of which was to protect dirty on architectures which
does not provide short or byte-wide atomics.

Reviewed by: alc, attilio
Tested by: flo (sparc64)
MFC after: 2 weeks


# 005f6091 28-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Use the explicitly-sized types for the dirty and valid masks.

Requested by: attilio
Reviewed by: alc
MFC after: 2 weeks


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# d98d0ce2 09-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag
to VPO_UNMANAGED (and also making the flag protected by the vm object
lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
VPO_UNMANAGED. As a consequence, pmap code now can use use just
VPO_UNMANAGED to decide whether the page is unmanaged.

Reviewed by: alc
Tested by: pho (x86, previous version), marius (sparc64),
marcel (arm, ia64, powerpc), ray (mips)
Sponsored by: The FreeBSD Foundation
Approved by: re (bz)


# 3c76db4c 19-Jun-2011 Alan Cox <alc@FreeBSD.org>

Precisely document the synchronization rules for the page's dirty field.
(Saying that the lock on the object that the page belongs to must be held
only represents one aspect of the rules.)

Eliminate the use of the page queues lock for atomically performing read-
modify-write operations on the dirty field when the underlying architecture
supports atomic operations on char and short types.

Document the fact that 32KB pages aren't really supported.

Reviewed by: attilio, kib


# 3b1025d2 11-Jun-2011 Konstantin Belousov <kib@FreeBSD.org>

Assert that page is VPO_BUSY or page owner object is locked in
vm_page_undirty(). The assert is not precise due to VPO_BUSY owner
to tracked, so assertion does not catch the case when VPO_BUSY is
owned by other thread.

Reviewed by: alc


# 10cf2560 11-Mar-2011 Alan Cox <alc@FreeBSD.org>

Eliminate duplication of the fake page code and zone by the device and sg
pagers.

Reviewed by: jhb


# 44e46b9e 17-Jan-2011 Alan Cox <alc@FreeBSD.org>

Explicitly initialize the page's queue field to PQ_NONE instead of relying
on PQ_NONE being zero.

Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for
PQ_NONE.

Reviewed by: kib@


# 43319c11 16-Jan-2011 Alan Cox <alc@FreeBSD.org>

Update a lock annotation on the page structure.


# 4c6a2e7a 16-Jan-2011 Alan Cox <alc@FreeBSD.org>

Shift responsibility for synchronizing access to the page's act_count
field to the object's lock.

Reviewed by: kib@


# 8c22654d 17-Dec-2010 Alan Cox <alc@FreeBSD.org>

Implement and use a single optimized function for unholding a set of pages.

Reviewed by: kib@


# aa546366 27-Nov-2010 Jayachandran C. <jchandra@FreeBSD.org>

Fix issue noted by alc while reviewing r215938:
The current implementation of vm_page_alloc_freelist() does not handle
order > 0 correctly. Remove order parameter to the function and use it
only for order 0 pages.

Submitted by: alc


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 49ca10d4 21-Jul-2010 Jayachandran C. <jchandra@FreeBSD.org>

Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by: alc


# b99348e5 09-Jul-2010 Alan Cox <alc@FreeBSD.org>

Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit. Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by: kib


# 1d9e77f6 08-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the
flag is always provided, and unconditionally retry after sleep for the
busy page or failed allocation.

The intent is to remove VM_ALLOC_RETRY eventually.

Proposed and reviewed by: alc


# 5f195aa3 05-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by: alc
MFC after: 2 weeks


# e239bb97 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Reimplement vm_object_page_clean(), using the fact that vm object memq
is ordered by page index. This greatly simplifies the implementation,
since we no longer need to mark the pages with VPO_CLEANCHK to denote
the progress. It is enough to remember the current position by index
before dropping the object lock.

Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused.
Garbage-collect vm.msync_flush_flags sysctl.

Suggested and reviewed by: alc
Tested by: pho


# b382c10a 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by: alc
Tested by: pho
MFC after: 1 week


# 9cf51988 02-Jul-2010 Alan Cox <alc@FreeBSD.org>

With the demise of page coloring, the page queue macros no longer serve any
useful purpose. Eliminate them.

Reviewed by: kib


# 91b4f427 21-Jun-2010 Alan Cox <alc@FreeBSD.org>

Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean(). When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by: kib@
MFC after: 3 weeks


# ce186587 10-Jun-2010 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines. Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked. Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick(). (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page. Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete. Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used. Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by: kib@


# 8f0d5d3b 29-May-2010 Alan Cox <alc@FreeBSD.org>

When I pushed down the page queues lock into pmap_is_modified(), I created
an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls
vm_page_dirty() must perform the call first. Otherwise, pmap_is_modified()
could return FALSE without acquiring the page queues lock because the page
is not (currently) writeable, and the caller to pmap_is_modified() might
believe that the page's dirty field is clear because it has not seen the
effect of the vm_page_dirty() call.

When I pushed down the page queues lock into pmap_is_modified(), I
overlooked one place where this ordering dependence is violated:
pmap_enter(). In a rare situation pmap_enter() can be called to replace a
dirty mapping to one page with a mapping to another page. (I say rare
because replacements generally occur as a result of a copy-on-write fault,
and so the old page is not dirty.) This change delays clearing PG_WRITEABLE
until after vm_page_dirty() has been called.

Fixing the ordering dependency also makes it easy to introduce a small
optimization: When pmap_enter() used to replace a mapping to one page with a
mapping to another page, it freed the pv entry for the first mapping and
later called the pv entry allocator for the new mapping. Now, pmap_enter()
attempts to recycle the old pv entry, saving two calls to the pv entry
allocator.

There is no point in setting PG_WRITEABLE on unmanaged pages, so don't.
Update a comment to reflect this.

Tidy up the variable declarations at the start of pmap_enter().


# 567e51e1 24-May-2010 Alan Cox <alc@FreeBSD.org>

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# 9ab6032f 16-May-2010 Alan Cox <alc@FreeBSD.org>

On entry to pmap_enter(), assert that the page is busy. While I'm
here, make the style of assertion used by pmap_enter() consistent
across all architectures.

On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.

With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy. Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.

Correct a long-standing bug in vm_page_cowsetup(). We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try. Previously, the call to pmap_remove_write()
would have failed silently.


# b2775308 10-May-2010 Alan Cox <alc@FreeBSD.org>

Update synchronization annotations for struct vm_page. Add a comment
explaining how the setting of PG_WRITEABLE is synchronized.


# c1f98960 07-May-2010 Alan Cox <alc@FreeBSD.org>

Update the synchronization requirements for the page usage count.


# fd8c28bf 06-May-2010 Alan Cox <alc@FreeBSD.org>

Update a comment to say that access to a page's wire count is now
synchronized by the page lock.


# 7024db1d 06-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues lock inside of vm_page_free_toq() and
pmap_page_is_mapped() in preparation for removing page queues locking
around calls to vm_page_free(). Setting aside the assertion that calls
pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page
queues lock just long enough to actually add or remove the page from the
paging queues.

Update vm_page_unhold() to reflect the above change.


# 5ac59343 05-May-2010 Alan Cox <alc@FreeBSD.org>

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


# 0ce3ba8c 30-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

Update locking comment above vm_page:
- re-assign page queue lock "Q"
- assign page lock "P"
- update several uncommented fields
- observe that hold_count is now protected by the page lock "P"


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# e67e0775 04-Oct-2009 Alan Cox <alc@FreeBSD.org>

Align and pad the page queue and free page queue locks so that the linker
can't possibly place them together within the same cache line.

MFC after: 3 weeks


# 461c7860 30-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate a stale comment and the two remaining uses of the "register"
keyword in this file.


# 1c1b26f2 12-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by: tegge


# 641e2829 03-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by: Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by: alc [2]
Reviewed by: alc
MFC after: 2 weeks


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# a8a478fc 26-Sep-2008 Ed Maste <emaste@FreeBSD.org>

Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.


# e5b006ff 19-Mar-2008 Alan Cox <alc@FreeBSD.org>

Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent
migration to vm/vm_page.c.


# 1fa94a36 18-Mar-2008 Alan Cox <alc@FreeBSD.org>

Almost seven years ago, vm/vm_page.c was split into three parts:
vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c. Today, vm/vm_pageq.c
has withered to the point that it contains only four short functions,
two of which are only used by vm/vm_page.c. Since I can't foresee any
reason for vm/vm_pageq.c to grow, it is time to fold the remaining
contents of vm/vm_pageq.c back into vm/vm_page.c.

Add some comments. Rename one of the functions, vm_pageq_enqueue(),
that is now static within vm/vm_page.c to vm_page_enqueue().
Eliminate PQ_MAXCOUNT as it no longer serves any purpose.


# c9444914 26-Sep-2007 Alan Cox <alc@FreeBSD.org>

Correct an error of omission in the reimplementation of the page
cache: vm_object_page_remove() should convert any cached pages that
fall with the specified range to free pages. Otherwise, there could
be a problem if a file is first truncated and then regrown.
Specifically, some old data from prior to the truncation might reappear.

Generalize vm_page_cache_free() to support the conversion of either a
subset or the entirety of an object's cached pages.

Reported by: tegge
Reviewed by: tegge
Approved by: re (kensmith)


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# 0f752392 12-Jul-2007 Alan Cox <alc@FreeBSD.org>

Update a comment describing the page queues.

Approved by: re (hrs)


# 2446e4f0 15-Jun-2007 Alan Cox <alc@FreeBSD.org>

Enable the new physical memory allocator.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re


# 04a18977 05-May-2007 Alan Cox <alc@FreeBSD.org>

Define every architecture as either VM_PHYSSEG_DENSE or
VM_PHYSSEG_SPARSE depending on whether the physical address space is
densely or sparsely populated with memory. The effect of this
definition is to determine which of two implementations of
vm_page_array and PHYS_TO_VM_PAGE() is used. The legacy
implementation is obtained by defining VM_PHYSSEG_DENSE, and a new
implementation that trades off time for space is obtained by defining
VM_PHYSSEG_SPARSE. For now, all architectures except for ia64 and
sparc64 define VM_PHYSSEG_DENSE. Defining VM_PHYSSEG_SPARSE on ia64
allows the entirety of my Itanium 2's memory to be used. Previously,
only the first 1 GB could be used. Defining VM_PHYSSEG_SPARSE on
sparc64 allows USIIIi-based systems to boot without crashing.

This change is a combination of Nathan Whitehorn's patch and my own
work in perforce.

Discussed with: kmacy, marius, Nathan Whitehorn
PR: 112194


# 9f5c801b 24-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the way that unmanaged pages are created. Specifically,
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage(). This allows for the elimination of some uses of
the page queues lock.

Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS. This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().

Remove vm_page_unmanage(). It is no longer used.


# 0cd31a0d 21-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the page's CLEANCHK flag from being a page queue mutex synchronized
flag to a vm object mutex synchronized flag.


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# e1cb7bc0 03-Sep-2006 Alan Cox <alc@FreeBSD.org>

Make vm_page_release_contig() static.


# eb4bbba8 27-Aug-2006 Alan Cox <alc@FreeBSD.org>

Refactor vm_page_sleep_if_busy() so that the test for a busy page is
inlined and a procedure call is made in the rare case, i.e., when it is
necessary to sleep. In this case, inlining the test actually makes the
kernel smaller.


# 09ef0d6e 24-Aug-2006 Alan Cox <alc@FreeBSD.org>

The return value from vm_pageq_add_new_page() is not used. Eliminate it.


# b146f9e5 12-Aug-2006 Alan Cox <alc@FreeBSD.org>

Reimplement the page's NOSYNC flag as an object-synchronized instead of a
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.

Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# 39fd9b63 24-Jan-2006 Alan Cox <alc@FreeBSD.org>

With the recent changes to the implementation of page coloring, the
the option PQ_NOOPT is used exclusively by vm_pageq.c. Thus, the
include of opt_vmpage.h can be removed from vm_page.h.


# ef39c05b 31-Dec-2005 Alexander Leidinger <netchild@FreeBSD.org>

MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)

MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)


# 8be9063e 04-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Don't perform a nested include of opt_vmpage.h if LIBMEMSTAT is defined,
as opt_vmpage.h will not be available to user space library builds. A
similar existing check is present for KLD_MODULE for similar reasons.

MFC after: 3 days


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 91f7a860 26-Dec-2004 Alan Cox <alc@FreeBSD.org>

Note that access to the page's busy count is synchronized by the containing
object's lock.


# 0f9f9bcb 24-Oct-2004 Alan Cox <alc@FreeBSD.org>

Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab()
that indicates that the caller does not want a page with its busy flag set.
In many places, the global page queues lock is acquired and released just
to clear the busy flag on a just allocated page. Both the allocation of
the page and the clearing of the busy flag occur while the containing vm
object is locked. So, the busy flag might as well never be set.


# 2bbb534b 22-Aug-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Move the cow field between wire_count and hold_count. This is the
position that is 64-bit aligned and makes sure that the valid and
dirty fields are also 64-bit aligned. This means that if PAGE_SIZE
is 32K, the size of the vm_page structure is only increased by 8
bytes instead of 16 bytes. More importantly, the vm_page structure
is either 120 or 128 bytes on ia64. These are "interesting" sizes.


# 4362fada 19-Jul-2004 Brian Feldman <green@FreeBSD.org>

Reimplement contigmalloc(9) with an algorithm which stands a greatly-
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.

The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less. The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block. If this is not enough, the subsequent passes
page out any unwired memory. To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.

The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed. Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.

There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used. Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month. It is safe to switch at
run-time to see the difference it makes.

A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig(). These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.

When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats. Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well. This invalidates the BUGS section of the
contigmalloc(9) manpage.


# 69c1a910 05-Jun-2004 Alan Cox <alc@FreeBSD.org>

Update stale comments regarding page coloring.


# 62326de7 03-Jun-2004 Alan Cox <alc@FreeBSD.org>

Move the definitions of SWAPBLK_NONE and SWAPBLK_MASK from vm_page.h to
blist.h, enabling the removal of numerous #includes from subr_blist.c.
(subr_blist.c and swap_pager.c are the only users of these definitions.)


# e3637856 30-May-2004 Alan Cox <alc@FreeBSD.org>

Remove a stale comment: PG_DIRTY and PG_FILLED were removed in
revisions 1.17 and 1.12 respectively.


# 05eb3785 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 889eb0fc 04-Apr-2004 Alan Cox <alc@FreeBSD.org>

Eliminate unused arguments from vm_page_startup().


# 45ad3d59 03-Mar-2004 Alan Cox <alc@FreeBSD.org>

Remove some long unused definitions.


# 93dbd071 25-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Align a comment within struct vm_page.
- Annotate the vm_page's valid field as synchronized by the containing
vm object's lock.


# ab42316c 22-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from
vm_pageout_scan(). Rationale: I don't like leaving a busy page in the
cache queue with neither the vm object nor the vm page queues lock held.
- Assert that the page is active in vm_pageout_page_stats().


# fee181a6 20-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Remove some long unused code.


# 669890ea 07-Oct-2003 Alan Cox <alc@FreeBSD.org>

Retire vm_page_copy(). Its reason for being ended when peter@ modified
pmap_copy_page() et al. to accept a vm_page_t rather than a physical
address. Also, this change will facilitate locking access to the vm page's
valid field.


# 16bc6ff3 25-Aug-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Assert that u_long is at least 64 bits if PAGE_SIZE is 32K.

Suggested by: phk


# 21a708cf 23-Aug-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Also define VM_PAGE_BITS_ALL for 16K and 32K pages. Make the constant
unsigned for all page sizes and unsigned long for 32K pages.


# 1fa057c6 23-Aug-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Add support for 16K and 32K page sizes. The valid and dirty maps
in struct vm_page are defined as u_int for 16K pages and u_long
for 32K pages, with the implied assumption that long will at least
be 64 bits wide on platforms where we support 32K pages.


# 227f9a1c 24-Mar-2003 Jake Burkholder <jake@FreeBSD.org>

- Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)


# 24c9ad6b 19-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Remove vm_page_sleep_busy(). The transition to vm_page_sleep_if_busy(),
which incorporates page queue and field locking, is complete.
- Assert that the page queue lock rather than Giant is held in
vm_page_flag_set().


# a12cc0e4 17-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove vm_page_protect(). Instead, use pmap_page_protect() directly.


# ada2a050 04-Nov-2002 Alan Cox <alc@FreeBSD.org>

Export the function vm_page_splay().


# 026aa839 31-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add a new flag to vm_page_alloc, VM_ALLOC_NOOBJ. This tells
vm_page_alloc not to insert this page into an object. The pindex is
still used for colorization.
- Rework vm_page_select_* to accept a color instead of an object and
pindex to work with VM_PAGE_NOOBJ.
- Document other VM_ALLOC_ flags.

Reviewed by: peter, jake


# f3b676f0 20-Oct-2002 Alan Cox <alc@FreeBSD.org>

o Reinline vm_page_undirty(), reducing the kernel size. (This reverts
a part of vm_page.h revision 1.87 and vm_page.c revision 1.167.)


# b86ec922 18-Oct-2002 Matthew Dillon <dillon@FreeBSD.org>

Replace the vm_page hash table with a per-vmobject splay tree. There should
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.

Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.

Submitted by: alc (this is Alan's code)
Approved by: re


# 99571dc3 18-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.


# fff6062a 24-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.


# 38f612e0 10-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Remove the setting and clearing of the PG_MAPPED flag from the alpha and
ia64 pmap.
o Remove the PG_MAPPED flag's declaration.


# e5f8bd94 29-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Introduce vm_page_sleep_if_busy() as an eventual replacement for
vm_page_sleep_busy(). vm_page_sleep_if_busy() uses the page
queues lock.


# 2c071f61 28-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Modify vm_page_grab() to accept VM_ALLOC_WIRED.


# 6fd77192 19-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Remove dead and/or unused code.


# 827b2fa0 17-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Introduce an argument, VM_ALLOC_WIRED, that requests vm_page_alloc()
to return a wired page.
o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because
Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical
section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only
VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.)
o Assert that the page queues mutex is held in vm_page_wire()
on Alpha, just like the other platforms.


# 1f545269 13-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Complete the locking of page queue accesses by vm_page_unwire().
o Assert that the page queues lock is held in vm_page_unwire().
o Make vm_page_lock_queues() and vm_page_unlock_queues() visible
to kernel loadable modules.


# 70c17636 04-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Resurrect vm_page_lock_queues(), vm_page_unlock_queues(), and the free
queue lock (revision 1.33 of vm/vm_page.c removed them).
o Make the free queue lock a spin lock because it's sometimes acquired
inside of a critical section.


# 98cb733c 25-Jun-2002 Kenneth D. Merry <ken@FreeBSD.org>

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# e78f35b3 25-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Turn VM_ALLOC_ZERO into a flag.

Submitted by: tegge
Reviewed by: dillon


# 8f2ba19c 27-May-2002 Alan Cox <alc@FreeBSD.org>

o Remove unused #defines.


# 44e74ba6 27-Apr-2002 Peter Wemm <peter@FreeBSD.org>

We do not necessarily need to map/unmap pages to zero parts of them.
On systems where physical memory is also direct mapped (alpha, sparc,
ia64 etc) this is slightly harmful.


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# 2be21c5e 04-Mar-2002 Alan Cox <alc@FreeBSD.org>

o Create vm_pageq_enqueue() to encapsulate code that is duplicated time
and again in vm_page.c and vm_pageq.c.
o Delete unusused prototypes. (Mainly a result of the earlier renaming
of various functions from vm_page_*() to vm_pageq_*().)


# 57145770 02-Mar-2002 Alan Cox <alc@FreeBSD.org>

Remove some long dead code.


# d2760948 19-Feb-2002 Tor Egge <tegge@FreeBSD.org>

Add a page queue, PQ_HOLD, that temporarily owns pages with nonzero hold
count that would otherwise be on one of the free queues. This eliminates a
panic when broken programs unmap memory that still has pending IO from raw
devices.

Reviewed by: dillon, alc


# 3516c025 24-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Implement idle zeroing of pages. I've been tinkering with this
on and off since John Dyson left his work-in-progress.

It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on.

There are some hacks here to deal with the present lack of preemption - we
yield after doing a small number of pages since we wont preempt otherwise.

This is basically Matt's algorithm [with hysteresis] with an idle process
to call it in a similar way it used to be called from the idle loop.

I cleaned up the includes a fair bit here too.


# 3a9b5daf 30-Jul-2001 Jake Burkholder <jake@FreeBSD.org>

Oops. Last commit to vm_object.c should have got these files too.

Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with: dillon


# d3e5863f 22-Jul-2001 Assar Westerlund <assar@FreeBSD.org>

make vm_page_select_cache static

Requested by: bde


# 0379d763 21-Jul-2001 Assar Westerlund <assar@FreeBSD.org>

(vm_page_select_cache): add prototype


# 6d03d577 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc).
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.

vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)


# 1b40f8c0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

Change inlines back into mainline code in preparation for mutexing. Also,
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.

(kernel size also reduced by 50-100K, but that isn't the prime intent)


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# ac8f990b 24-May-2001 Matthew Dillon <dillon@FreeBSD.org>

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 2b6b0df7 26-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# ee2fd038 25-Aug-2000 David E. O'Brien <obrien@FreeBSD.org>

Make the arguments match the functionality of the functions.


# 4c8545c1 11-Jul-2000 Alfred Perlstein <alfred@FreeBSD.org>

#elsif -> #elif

Noticed by: green


# 9a20f99a 04-Jul-2000 John Baldwin <jhb@FreeBSD.org>

Replace the PQ_*CACHE options with a single PQ_CACHESIZE option that you
set equal to the number of kilobytes in your cache. The old options are
still supported for backwards compatibility.

Submitted by: Kelly Yancey <kbyanc@posi.net>


# 8b03c8ed 29-May-2000 Matthew Dillon <dillon@FreeBSD.org>

This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.

A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.

Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# c4473420 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 4f79d873 11-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# be72f788 30-Oct-1999 Alan Cox <alc@FreeBSD.org>

The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.

Reviewed by: dillon


# 90ecac61 16-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>

Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.

Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().

add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 38c808ed 17-Aug-1999 Brian Feldman <green@FreeBSD.org>

Unbreak the nfs KLD_MODULE. It needs a bit more of vm_page.h than was
exported (notably vm_page_undirty()). Also, let vm_page_dirty() work
in a KLD.


# 2c28a105 16-Aug-1999 Alan Cox <alc@FreeBSD.org>

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


# 175d1f69 14-Aug-1999 Alan Cox <alc@FreeBSD.org>

contigmalloc1 (currently) depends on PQ_FREE and PQ_CACHE not being 0
to tell a valid "struct vm_page" from an invalid one in the vm_page_array.
This isn't a very robust method.


# 3739d7da 14-Aug-1999 Matt Jacob <mjacob@FreeBSD.org>

Add back in old definitions if we're compiling for alpha.


# 514bfcc4 14-Aug-1999 Alan Cox <alc@FreeBSD.org>

Don't create a "struct vpgqueues" for PQ_NONE.


# 1aefb1d9 12-Aug-1999 Alan Cox <alc@FreeBSD.org>

Make the default page coloring parameters match a (non-Xeon) Pentium II/III.

This setting is also acceptable for Celerons and Pentium Pros
with less than 1MB L2 caches.

Note: PQ_L2_SIZE is a misnomer. The correct number of colors is
a function of the cache's degree of associativity as well as its size.

Submitted by: bde and alc


# 5d2aec89 31-Jul-1999 Alan Cox <alc@FreeBSD.org>

Change the type of vpgqueues::lcnt from "int *" to "int". The indirection
served no purpose.


# 3b213483 22-Jul-1999 Alan Cox <alc@FreeBSD.org>

Reduce the number of "magic constants" used for page coloring
by one: PQ_PRIME2 and PQ_PRIME3 are used to accomplish the same
thing at different places in the kernel. Drop PQ_PRIME3.


# 9c89c228 22-Jun-1999 Alan Cox <alc@FreeBSD.org>

Remove (1) "extern" declarations for variables that were previously
made "static" and (2) initialized but unused variables.


# 6ea5bd80 19-Jun-1999 Alan Cox <alc@FreeBSD.org>

Remove some unused function and variable declarations.


# 4221e284 02-May-1999 Alan Cox <alc@FreeBSD.org>

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 8d17e694 05-Apr-1999 Julian Elischer <julian@FreeBSD.org>

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


# 811c2e1a 14-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Fix breakage in last commit
Submitted by: Brian Feldman <green@unixhelp.org>


# 0237469f 14-Mar-1999 Julian Elischer <julian@FreeBSD.org>

A bit of a hack, but allows the vn device to be a module again.

Submitted by: Matt Dillon <dillon@freebsd.org>


# efcae3d3 14-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Minor reorganization of vm_page_alloc(). No functional changes have
been made but the code has been reorganized and documented to make
it more readable, reduce the size of the code, and optimize the branch
path caching capabilities that most modern processors have.


# faa273d5 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.

* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)

PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).

When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.

Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.


# a0e7b3e5 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove L1 cache coloring optimization ( leave L2 cache coloring opt ).

Rewrite vm_page_list_find() and vm_page_select_free() - make inline out
of nominal case.


# 161a8f65 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Add vm_page_dirty() inline with PQ_CACHE sanity check


# a7039a1d 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Add invariants to vm_page_busy() and vm_page_wakeup() to check for
PG_BUSY stupidity.


# 41dbefba 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

The TAILQ hashq has been turned into a singly-linked=list link,
reducing the size of vm_page_t.

SWAPBLK_NONE and SWAPBLK_MASK are defined here. These actually are
more generalized then their names imply, but their placement is somewhat
of a legacy issue from a prior test version of this code that put
the swapblk in the vm_page_t structure. That test code was eventually
thrown away. The legacy remains.

Added vm_page_flash() inline. Similar to vm_page_wakeup() except that
it does not clear PG_BUSY ( one assumes that PG_BUSY is already clear ).
Used by a number of routines to wakeup waiters.

Collapsed some of the code in inline calls to make other inline calls.
GCC will optimize this well and it reduces duplication.

vm_page_free() and vm_page_free_zero() inlines added to convert to
the proper vm_page_free_toq() call.

vm_page_sleep_busy() inline added, replacing vm_page_sleep() ( which has
been removed ). This implements a much more optimizable page-waiting
function.


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 73007561 28-Oct-1998 David Greenman <dg@FreeBSD.org>

Added a second argument, "activate" to the vm_page_unwire() call so that
the caller can select either inactive or active queue to put the page on.


# 300ee824 21-Oct-1998 David Greenman <dg@FreeBSD.org>

Nuked PG_TABLED flag. Replaced with m->object != NULL.


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 85e7f549 01-Sep-1998 Garrett Wollman <wollman@FreeBSD.org>

Separate wakeup conditions for page I/O count (pg_busy) and lock (PG_BUSY).
This is not sa completely solution to the deadlock, but the additional wakeups
have helped in my observation.

Suggested by: John Dyson


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 9e802365 22-Aug-1998 Stephen McKay <mckay@FreeBSD.org>

Correct/clarify some comments.


# 20f71813 30-Jun-1998 John-Mark Gurney <jmg@FreeBSD.org>

document some VM paging options for cache sizes:
PQ_NOOPT no coloring
PQ_LARGECACHE used for 512k/16k cache
PQ_HUGECACHE used for 1024k/16k cache


# be160d60 21-Jun-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# b9cefc08 23-May-1998 John Dyson <dyson@FreeBSD.org>

Support a 16K first level cache for 512K 2nd level. Also, add support
for 1MB 2nd level cache.


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 853a7bc8 06-Oct-1996 John Dyson <dyson@FreeBSD.org>

Make the default cache size optim to be 256K, the old default was
64K. The change has essentially neutral effect on those machines with
little or no cache, and has a positive effect on "normal" machines
with 256K or more cache.


# 5070c7f8 08-Sep-1996 John Dyson <dyson@FreeBSD.org>

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 38efa82b 25-Jun-1996 John Dyson <dyson@FreeBSD.org>

This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.

Added some sysctls that help with VM system tuning.

New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.

vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.

Please give me feedback on the performance results
associated with these.

2) Enable/disable swapping.
vm.swapping_enabled = 1, default.

vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.

The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".

Each of these can be changed "on the fly."


# 886d3e11 08-Jun-1996 John Dyson <dyson@FreeBSD.org>

Adjust the threshold for blocking on movement of pages from the cache
queue in vm_fault.

Move the PG_BUSY in vm_fault to the correct place.

Remove redundant/unnecessary code in pmap.c.

Properly block on rundown of page table pages, if they are busy.

I think that the VM system is in pretty good shape now, and the following
individuals (among others, in no particular order) have helped with this
recent bunch of bugs, thanks! If I left anyone out, I apologize!

Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard,
Marc Fournier.


# 6b6f0008 04-Jun-1996 John Dyson <dyson@FreeBSD.org>

Keep page-table pages from ever being sensed as dirty. This should fix
some problems with the page-table page management code, since it can't
deal with the notion of page-table pages being paged out or in transit.
Also, clean up some stylistic issues per some suggestions from
Stephen McKay.


# 7f5fe93f 17-May-1996 John Dyson <dyson@FreeBSD.org>

One more file missing from the mega-commit. This inlines some very
simple routines in vm_page.c, so that an unnecessary subroutine call
is removed.


# 8169788f 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 6c5e9bbd 30-Jan-1996 Mike Pritchard <mpp@FreeBSD.org>

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# 3af76890 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused vars & funcs, make things static, protoize a little bit.


# d559b369 22-Oct-1995 John Dyson <dyson@FreeBSD.org>

Remove of now unused PG_COPYONWRITE.


# 10ad4d48 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Added prototype for new routine "vm_page_set_validclean" and initial
declarations for the prezeroed pages mechanism.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 7666fb47 23-Apr-1995 Bruce Evans <bde@FreeBSD.org>

inline -> __inline.

Headers should always use `__inline' for inline functions to avoid
syntax errors when modules that don't even use the offending functions
are compiled with `gcc -ansi'.


# 8953151f 26-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed some obsolete flags.

Submitted by: John Dyson


# f919ebde 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# 0c1dacbc 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Moved ACT_MAX, ACT_ADVANCE, and ACT_DECLINE to vm_page.h.


# d2fc5315 13-Feb-1995 Poul-Henning Kamp <phk@FreeBSD.org>

YF fix.


# 6d40c3d3 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# a776a317 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Kill VM_PAGE_INIT macro as it is only used once and makes the code more
difficult to understand. Got rid of unused vm_page flags.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 4f21005b 14-Nov-1994 Bruce Evans <bde@FreeBSD.org>

pmap.h:
Disable the bogus declaration of pmap_bootstrap(). Since its arg list
is machine-dependent, it must be declared in a machine-dependent header.

vm_page.h:
Change `inline' to `__inline' and old-style function parameter lists for
inlined functions to new-style.

`inline' and old-style function parameter lists should never be used in
system headers, even in very machine-dependent ones, because they cause
warnings from gcc -Wreally-all.


# 091b0456 20-Oct-1994 Garrett Wollman <wollman@FreeBSD.org>

Make my ALLDEVS kernel compile (basically, LINT minus a lot of options).

This involves fixing a few things I broke last time.


# 27de4e40 17-Oct-1994 David Greenman <dg@FreeBSD.org>

Put sanity check for negative hold count into #ifdef DIAGNOSTIC so that
it doesn't consume an extra 3k of kernel text because of gcc's bogus
inlining code.


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# d3c2cf7a 27-Sep-1994 David Greenman <dg@FreeBSD.org>

1) New "vm_page_alloc_contig" routine by me.
2) Created a new vm_page flag "PG_FREE" to help track free pages.
3) Use PG_FREE flag to detect inconsistencies in a few places.


# a647a309 06-Sep-1994 David Greenman <dg@FreeBSD.org>

Simple changes to paging algorithms...but boy do they make a difference.
FreeBSD's paging performance has never been better. Wow.

Submitted by: John Dyson


# bbc0ec52 03-Aug-1994 David Greenman <dg@FreeBSD.org>

Integrated VM system improvements/fixes from FreeBSD-1.1.5.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources