History log of /freebsd-current/sys/vm/vm_swapout.c
Revision Date Author Comments
# 7a79d066 09-Apr-2024 Bojan Novković <bnovkov@FreeBSD.org>

vm: improve kstack_object pindex calculation to avoid pindex holes

This commit replaces the linear transformation of kernel virtual
addresses to kstack_object pindex values with a non-linear
scheme that circumvents physical memory fragmentation caused by
kernel stack guard pages. The new mapping scheme is used to
effectively "skip" guard pages and assign pindices for
non-guard pages in a contiguous fashion.

The new allocation scheme requires that all default-sized kstack KVAs
come from a separate, specially aligned region of the KVA space.
For this to work, this commited introduces a dedicated per-domain
kstack KVA arena used to allocate kernel stacks of default size.
The behaviour on 32-bit platforms remains unchanged due to a
significatly smaller KVA space.

Aside from fullfilling the requirements imposed by the new scheme, a
separate kstack KVA arena facilitates superpage promotion in the rest
of kernel and causes most kstacks to have guard pages at both ends.

Reviewed by: alc, kib, markj
Tested by: markj
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D38852


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# b8ebd99a 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

vm: Use __diagused for variables only used in KASSERT().


# 0f559a9f 17-Oct-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make vmdaemon timeout configurable

Make vmdaemon timeout configurable, so that one can adjust
how often it runs.

Here's a trick: set this to 1, then run 'limits -m 0 sh',
then run whatever you want with 'ktrace -it XXX', and observe
how the working set changes over time.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D22038


# f13fa9df 26-Apr-2020 Mark Johnston <markj@FreeBSD.org>

Use a single VM object for kernel stacks.

Previously we allocated a separate VM object for each kernel stack.
However, fully constructed kernel stacks are cached by UMA, so there is
no harm in using a single global object for all stacks. This reduces
memory consumption and makes it easier to define a memory allocation
policy for kernel stack pages, with the aim of reducing physical memory
fragmentation.

Add a global kstack_object, and use the stack KVA address to index into
the object like we do with kernel_object.

Reviewed by: kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24473


# c99d0c58 28-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Add a blocking counter KPI.

refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter. However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.

This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.

Reviewed by: jeff, kib, mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23723


# e9ceb9dd 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't release xbusy on kmem pages. After lockless page lookup we will not
be able to guarantee that they can be racquired without blocking.

Reviewed by: kib
Discussed with: markj
Differential Revision: https://reviews.freebsd.org/D23506


# d6e13f3b 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't hold the object lock while calling getpages.

The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033


# 9f5632e6 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking for queue operations.

With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885


# 3c01c56b 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Don't update per-page activation counts in the swapout code.

This avoids duplicating the work of the page daemon's active queue scan.
Moreover, this duplication was inconsistent:
- PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced()
returned 0, but the page daemon always counts PGA_REFERENCED towards
the activation count.
- The swapout daemon always activates a referenced page, but the page
daemon only does so when the containing object is mapped at least
once.

The main purpose of swapout_deactivate_pages() is to shrink the number
of pages mapped into a given pmap. To do this without unmapping active
pages, use the non-destructive pmap_is_referenced() instead of the
destructive pmap_ts_referenced() and deactivate pages accordingly.
This simplifies some future changes to the locking protocol for page
queue state.

Reviewed by: kib
Discussed with: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22674


# 61a74c5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 1/4

Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.

Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626


# 9be9ea42 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Add a helper function to the swapout daemon's deactivation code.

vm_swapout_object_deactivate_pages() is renamed to
vm_swapout_object_deactivate(), and the loop body is moved into the new
vm_swapout_object_deactivate_page(). This makes the code a bit easier
to follow and is in preparation for some functional changes.

Reviewed by: jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22651


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# 2288078c 08-Oct-2019 Doug Moore <dougm@FreeBSD.org>

Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# 11b57401 12-Sep-2019 Hans Petter Selasky <hselasky@FreeBSD.org>

Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision: https://reviews.freebsd.org/D21620
Reviewed by: kib@, markj@
MFC after: 1 week
Sponsored by: Mellanox Technologies


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 386eba08 23-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Make vm_pqbatch_submit_page() externally visible.

It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent. No functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21369


# 930b1952 21-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Don't requeue active pages in vm_swapout_object_deactivate_pages().

As of r332974 the page daemon does not requeue pages during a scan
of the active queue, so there is not much value in doing so here
either.

Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21343


# 0b26119b 06-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Cache kernel stacks in UMA. This gives us NUMA support, better concurrency,
and more statistics.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20931


# eeacb3b0 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Merge the vm_page hold and wire mechanisms.

The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended. __FreeBSD_version is bumped.

Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247


# f82dd310 20-Nov-2018 Ben Widawsky <bwidawsk@FreeBSD.org>

linuxkpi: Use pageproc instead of vmproc

According to markj@:
pageproc contains the page daemon and laundry threads, which are
responsible for managing the LRU page queues and writing back dirty
pages. vmproc's main task is to swap out kernel stacks when the system
is under memory pressure, and swap them back in when necessary. It's a
somewhat legacy component of the system and isn't required. You can
build a kernel without it by specifying "options NO_SWAPPING" (which is
a somewhat misleading name), in which vm_swapout_dummy.c is compiled
instead of vm_swapout.c.

Based on this, we want pageproc to emulate kswapd, not vmproc.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D18061


# c3f4f28c 20-Nov-2018 Ben Widawsky <bwidawsk@FreeBSD.org>

linuxkpi: Add some basic swap functions

These are used by kms-drm to determine various heuristics relate
memory conditions.

The number of free swap pages is just a variable, and it can be
much cheaper by either adding a new getter, or simply extern'ing
swap_total. However, this patch opts to use the more expensive,
existing interface - since this isn't an operation in a high per
path.

This allows us to remove some more gpl linuxkpi and do the follo
kms-drm:
git rm linuxkpi/gplv2/include/linux/swap.h

Reviewed by: mmacy, Johannes Lundberg <johalun0@gmail.com>
Approved by: emaste (mentor)
Differential Revision: https://reviews.freebsd.org/D18052


# 2a843ae7 22-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Avoid a redundancy in a comment updated by r339601.

Reported by: alc
X-MFC with: r339601


# b0058196 22-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Swap in processes unless there's a global memory shortage.

On NUMA systems, we would not swap in processes unless all domains
had some free pages. This is too conservative in general. Instead,
permit swapins so long as at least one domain has free pages, and add
a kernel stack NUMA policy which ensures that we will try to allocate
kernel stack pages from any domain.

Reported and tested by: pho, Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17304


# c1344d2b 13-Aug-2018 Konstantin Belousov <kib@FreeBSD.org>

Prevent some parallel swap-ins, rate-limit swapper swap-ins.

If faultin() was called outside swapper (from PHOLD()), do not allow
swapper to initiate additional swap-ins. Swapper' initiated swap-ins
are serialized because they are synchronous and executed in the
context of the thread0. With the added limitation, we only allow
parallel swap-ins from PHOLD(), which is up to PHOLD() users to
manage, usually they do not need to.

Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds
interval, counting faultin() swapins.

Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16610


# a70e9a13 04-Aug-2018 Konstantin Belousov <kib@FreeBSD.org>

Swap in WKILLED processes.

Swapped-out process that is WKILLED must be swapped in as soon as
possible. The reason is that such process can be killed by OOM and
its pages can be only freed if the process exits. To exit, the kernel
stack of the process must be mapped.

When allocating pages for the stack of the WKILLED process on swap in,
use VM_ALLOC_SYSTEM requests to increase the chance of the allocation
to succeed.

Add counter of the swapped out processes to avoid unneeded iteration
over the allprocs list when there is no work to do, reducing the
allproc_lock ownership.

Reviewed by: alc, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16489


# 5cd29d0f 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Improve VM page queue scalability.

Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# 33937731 04-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Restructure swapout tests after vm map locking was removed.

Consolidate the regions covered by the process lock.
Combine similar conditions tests into one, e.g. all process flags can
be test with one logical operation.
Add check for in-exec state, since p_vmspace is dererenced.
Remove labels and goto by explicitly tracking state.
Update comments.

Reviewed by: alc, markj (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13693


# 36ca312d 03-Jan-2018 Alan Cox <alc@FreeBSD.org>

Once we have decided to swap out a process, don't delay the laundering of
its per-thread kernel stack pages by making them pass through the inactive
queue first. Instead, immediately place them in the laundry so that they
might be cleaned and made available for reclamation sooner.

Reviewed by: kib, markj
MFC after: 1 week


# 9997c048 01-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Do not let vm_daemon run unbounded.

On a load where single anonymous object consumes almost all memory on
the large system, swapout code executes the iteration over the
corresponding object page queue for long time, owning the map and
object locks. This blocks pagedaemon which tries to lock the object,
and blocks other threads in the process in vm_fault() waiting for the
map lock.

Handle the issue by terminating the deactivation loop if we executed
too long and by yielding at the top level in vm_daemon.

Reported by: peterj, pho
Reviewed by: alc
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13671


# 7000c588 31-Dec-2017 Alan Cox <alc@FreeBSD.org>

The variable "minslptime" is pointless and always has been, ever since its
introduction in r83366. (At that time, this code appeared in vm/vm_glue.c,
because vm/vm_swapout.c did not exist.) When the FOREACH_THREAD loop
completes, we know that the sleep time for every thread is above whichever
threshold is being applied.

Reviewed by: kib
X-MFC with: r327354


# b6eabc36 29-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not lock vm map in swapout_procs().

Neither swapout_procs() nor swapout() access the map. Since the
process' vmspace is referenced only to obtain the pointer to the
vm_map, the reference is not needed as well.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13681


# e258b4a0 29-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Style.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13678


# 89c0e67d 28-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Clean up the comment.

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13671


# 0080a8fa 28-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

In vm_swapout_map_deactivate_pages(), it is enough to lock the map for read.

Reviewed by: alc, markj (as part of the larger patch)
Tested by: pho (again, as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13671


# 796df753 30-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

SPDX: Consider code from Carnegie-Mellon University.

Interesting cases, most likely from CMU Mach sources.


# ac04195b 20-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Move swapout code into vm/vm_swapout.c.

There is no NO_SWAPPING #ifdef left in the code.

Requested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12663