History log of /freebsd-current/sys/vm/vm_phys.c
Revision Date Author Comments
# 543d55d7 04-Jun-2024 Doug Moore <dougm@FreeBSD.org>

vm_phys: use ilog2(x) instead of fls(x)-1

One of these changes saves two instructions on an amd64
GENERIC-NODEBUG build. The rest are entirely cosmetic, because the
compiler can deduce that x is nonzero, and avoid the needless test.

Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D45331


# e3537f92 03-Jun-2024 Doug Moore <dougm@FreeBSD.org>

Revert "subr_pctrie: use ilog2(x) instead of fls(x)-1"

This reverts commit 574ef650695088d56ea12df7da76155370286f9f.


# 574ef650 02-Jun-2024 Doug Moore <dougm@FreeBSD.org>

subr_pctrie: use ilog2(x) instead of fls(x)-1

In three instances where fls(x)-1 is used, the compiler does not know
that x is nonzero and so adds needless zero checks. Using ilog(x)
instead saves, in each instance, about 4 instructions, including a
conditional, and 16 or so bytes, on an amd64 build.

Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D45330


# cb20a74c 03-Apr-2024 Stephen J. Kiernan <stevek@FreeBSD.org>

vm: add macro to mark arguments used when NUMA is defined

This fixes compiler warnings when -Wunused-arguments is enabled and
not quieted.

Reviewed by: kib, markj
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D44623


# 6dd15b7a 20-Dec-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys; fix uncalled free_contig

Function vm_phys_free_contig does not always free memory properly when
the npages parameter is less than max block size. Change it so that it does.

Note that this function is not currently invoked, and this error was
not triggered in earlier versions of the code.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42891


# 2a4897bd 15-Nov-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys: fix freelist_contig

vm_phys_find_freelist_contig is called to search a list of max-sized
free page blocks and find one that, when joined with adjacent blocks
in memory, can satisfy a request for a memory allocation bigger than
any single max-sized free page block. In commit
fa8a6585c7522b7de6d29802967bd5eba2f2dcf1, I defined this function in
order to offer two improvements: 1) reduce the worst-case search time,
and 2) allow solutions that include less-than max-sized free page
blocks at the front or back of the giant allocation. However, it turns
out that this change introduced an error, reported in In Bug
274592. That error concerns failing to check segment boundaries. This
change fixes an error in vm_phys_find_freelist_config that resolves
that bug. It also abandons improvement 2), because the value of that
improvement is small and because preserving it would require more
testing than I am able to do.

PR: 274592
Reported by: shafaisal.us@gmail.com
Reviewed by: alc, markj
Tested by: shafaisal.us@gmail.com
Fixes: fa8a6585c752 vm_phys: avoid waste in multipage allocation
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D42509


# c415cfc8 12-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

vm_phys: Add corresponding sysctl knob for loader tunable

The loader tunable 'vm.numa.disabled' does not have corresponding sysctl
MIB entry. Add it so that it can be retrieved, and `sysctl -T` will also
report it correctly.

Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42138


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# e77f4e7f 04-Aug-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys: tune vm_phys_enqueue_contig loop

Rewrite the final loop in vm_phys_enqueue_contig as a new function,
vm_phys_enq_beg, to reduce amd64 code size.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D41289


# ccdb2827 04-Aug-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys_enq_range: no alignment assert for npages==0

Do not assume that when vm_phys_enq_range is passed npages==0 that the
vm_page argument is valid in any way, much less that it has a
page-aligned address. Just don't look at it. Assert nothing about it.

Reported by: karels
Differential Revision: https://reviews.freebsd.org/D41317


# c9b06fa5 03-Aug-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys_enqueue_contig: handle npages==0

By letting vm_phys_enqueue_contig handle the case when npages == 0,
the callers can stop checking it, and the compiler can stop
zero-checking with every call to ffs(). Letting vm_phys_enqueue_contig
call vm_phys_enqueue_contig for part of its work also saves a few
bytes.

The amd64 object code shrinks by 128 bytes.

Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D41154


# b7370efa 02-Aug-2023 Doug Moore <dougm@FreeBSD.org>

Revert "vm_phys_enqueue_contig: handle npages==0"

This reverts commit 1a7fcf6d51eb67ee3e05fdbb806f7e68f9f53c9c.

Peter Holm reported a problem, so I'm reverting now and looking for
the problem later.


# 1a7fcf6d 01-Aug-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys_enqueue_contig: handle npages==0

By letting vm_phys_enqueue_contig handle the case when npages == 0,
the callers can stop checking it, and the compiler can stop
zero-checking with every call to ffs(). Letting vm_phys_enqueue_contig
call vm_phys_enqueue_contig for part of its work also saves a few
bytes.

The amd64 object code shrinks by 80 bytes.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D41154


# 58d42717 16-Jun-2023 Alan Cox <alc@FreeBSD.org>

vm_phys: Fix typo in 9e8174289236


# 9e817428 16-Jun-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys: add binary segment search

Replace several sequential searches for a segment that contains a
phyiscal address with a call to a function that does it by binary
search. In vm_page_reclaim_contig_domain_ext, find the first segment
to reclaim from, and reclaim from each subsequent appropriate segment.
Eliminate vm_phys_scan_contig.

Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D40058


# 6062d9fa 05-Jun-2023 Mark Johnston <markj@FreeBSD.org>

vm_phys: Change the return type of vm_phys_unfree_page() to bool

This is in keeping with the trend of removing uses of boolean_t, and the
sole caller was implicitly converting it to a "bool".

No functional change intended.

Reviewed by: dougm, alc, imp, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40401


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# fa8a6585 26-Apr-2022 Doug Moore <dougm@FreeBSD.org>

vm_phys: avoid waste in multipage allocation

In vm_phys_alloc_contig, for an allocation bigger than the size of any
buddy queue free block, avoid examining any maximum-size free block
more than twice, by only starting to consider a sequence of adjacent
max-blocks starting at a max-block that does not follow another
max-block. If that first max-block follows adjacent blocks of smaller
size, and if together they provide enough memory to reduce by one the
number of max-blocks required for this allocation, use them as part of
this allocation.

Reviewed by: markj
Tested by: pho
Discussed with: alc
Differential Revision: https://reviews.freebsd.org/D34815


# 52526922 18-Apr-2022 John Baldwin <jhb@FreeBSD.org>

vm_phys_init: Quiet unused but set warnings about npages.

npages is used in two optional cases:

- to conditionally create a separate DMA32 free list

- to index vm_page_array for VM_PHYSSEG_SPARSE

Add in more #ifdef's around npages statements.

Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D34887


# 2e7838ae 08-Apr-2022 John Baldwin <jhb@FreeBSD.org>

vm_phys_early_alloc: mem_index is only used under #ifdef NUMA.

Possibly mem_index should just reuse biggestone since this loop is
already reusing biggestsize.


# 557dc337 31-Mar-2022 Doug Moore <dougm@FreeBSD.org>

vm_phys: check small blocks to finish allocation

In vm_phys_alloc_queues_contig, in the case that a sequence of
max-order blocks are sought to fulfill an allocation, a sequence is
ruled out if it does not have enough max-order blocks to satisfy the
allocation. However, there may be smaller blocks of free memory that
follow the last max-order block in the sequence, and they may be big
enough to complete the allocation request, so check for that
possibility before giving up on that block sequence.

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D34724


# 342056fa 31-Mar-2022 Doug Moore <dougm@FreeBSD.org>

vm_phys: alloc pages without duplicating searches.

In the search for contiguous pages, as each page segment is examined,
check to see if the free list set for the next page segment differs
from the set for the current segment, and avoid a pointless search if
they do not differ.

Discussed with: alc
Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D33947


# 0ce7909c 18-Jan-2022 Doug Moore <dougm@FreeBSD.org>

vm_phys: add essential segment bounds check

A lower-bound segment check is necessary in vm_phys_alloc_seg_contig.
Add one.

Reported by: jenkins
Reviewed by: alc
Fixes: da92ecbc0d8f vm_phys: fix seg->end test in alloc_seg_contig
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33945


# da92ecbc 17-Jan-2022 Doug Moore <dougm@FreeBSD.org>

vm_phys: fix seg->end test in alloc_seg_contig

In vm_phys_alloc_seg_contig, in allocating multiple memory blocks for
a huge allocation, ensure that the end of the allocated range does not
exceed the upper segment limit.

Reorder a couple of checks to improve code layout.

Reviewed by: alc
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33870


# e6930b1c 30-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_phys: convert error back to warning

Move an assignment back to where it was before, to turn the
defined-but-not-used error back into a set-but-not-used warning.

Fixes: 01e115ab83a4 vm_phys: #include vm_extern


# 01e115ab 30-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_phys: #include vm_extern

Arm64 and powerpc don't include vm_extern.h indirectly in vm_phys.c, which
means that for the sake of those architectures, it must be included explicitly.

Also, fix a set-unused warning that jenkins also found.

Reported by: Jenkins
Fixes: c606ab59e7f9 vm_extern: use standard address checkers everywhere


# c606ab59 30-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_extern: use standard address checkers everywhere

Define simple functions for alignment and boundary checks and use them
everywhere instead of having slightly different implementations
scattered about. Define them in vm_extern.h and use them where
possible where vm_extern.h is included.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33685


# 8119cdd3 29-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_phys: hide vm_phys_set_pool

It is only called in the file that defines it, so make it static and
remove the declaration from the header.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33688


# 31991a5a 29-Sep-2021 Mitchell Horne <mhorne@FreeBSD.org>

minidump: De-duplicate is_dumpable()

The function is identical in each minidump implementation, so move it to
vm_phys.c. The only slight exception is powerpc where the function was
public, for use in moea64_scan_pmap().

Reviewed by: kib, markj, imp (earlier version)
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31884


# 431fb8ab 18-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vm_phys: Try to clean up NUMA KPIs

It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207


# 114484b7 23-Sep-2020 Mark Johnston <markj@FreeBSD.org>

Flag vm_reserv and vm_phys sysctls as MPSAFE.

Nothing in these subsystems relies on Giant.

MFC after: 1 week


# 81302f1d 28-May-2020 Mark Johnston <markj@FreeBSD.org>

Fix boot on systems where NUMA domain 0 is unpopulated.

- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.

Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# b649c2ac 22-Dec-2019 Doug Moore <dougm@FreeBSD.org>

Fix typo using RB_INITIALIZER.

The macro RB_INITIALIZER ignores its argument, but is documented to
require "&head" as argument to initialize "head". So using
"_vm_phys_fictitious_tree" as the argument to initialize
"vm_phys_fictitious_tree" is an inconsequential error, corrected here.

Discussed with: alc


# 3921068f 18-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Remove unnecessary debugging from r351181 that caused powerpc build to fail.

Tested by: make universe TARGETS=powerpc


# be3f5f29 18-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

vm_phys_avail_find is only used on NUMA kernels. Fix a build error.


# b7565d44 18-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Encapsulate phys_avail manipulation in a set of simple routines. Add a
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures. Code partially submitted by gallatin.

Reviewed by: gallatin, kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21251


# 21943937 15-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Move phys_avail definition into MI code. It is consumed in the MI layer and
doing so adds more flexibility with less redundant code.

Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250


# c1685086 06-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Add two new kernel options to control memory locality on NUMA hardware.
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
was freed on a different domain from where it was allocated. This is
only used for UMA_ZONE_NUMA (first-touch) zones.
- UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
zones. This tries to maintain locality for kernel memory.

Reviewed by: gallatin, alc, kib
Tested by: pho, gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20929


# b8590dae 31-May-2019 Doug Moore <dougm@FreeBSD.org>

The function vm_phys_free_contig invokes vm_phys_free_pages for every
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time. The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.

The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.

One beneficiary of this change is in breaking reservations. For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901


# 75d6d576 27-Feb-2019 Mateusz Guzik <mjg@FreeBSD.org>

vm: remove seq.h inclusion made obsolete by NUMA rewrite

Sponsored by: The FreeBSD Foundation


# f2a496d6 18-Jan-2019 Konstantin Belousov <kib@FreeBSD.org>

MI VM: Make it possible to set size of superpage at boot instead of compile time.

In order to allow single kernel to use PAE pagetables on i386 if
hardware supports it, and fall back to classic two-level paging
structures if not, superpage code should be able to adopt to either 2M
or 4M superpages size. There I make MI VM structures large enough to
track the biggest possible superpage, by allowing architecture to
define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants.
Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be
defined as runtime values and must be less than the _MAX constants.
If architecture does not define _MAXs, it is assumed that _MAX ==
normal constant.

Reviewed by: markj
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18853


# 87ab1a10 23-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Initialize static domainsets regardless of whether an SRAT is present.

Reported by: yuripv
X-MFC with: r339452
Sponsored by: The FreeBSD Foundation


# b61f3142 22-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Make it possible to disable NUMA support with a tunable.

This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration. With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR: 231460
Reviewed by: alc, gallatin, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17439


# 662e7fa8 20-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Create some global domainsets and refactor NUMA registration.

Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by: alc, gallatin, jeff, kib
Tested by: pho (part of a larger patch)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17416


# 463406ac 24-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Add more NUMA-specific low memory predicates.

Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.

Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278


# 72aebdd7 02-Sep-2018 Alan Cox <alc@FreeBSD.org>

Recent changes have created, for the first time, physical memory segments
that can be coalesced. To be clear, fragmentation of phys_avail[] is not
the cause. This fragmentation of vm_phys_segs[] arises from the "special"
calls to vm_phys_add_seg(), in other words, not those that derive directly
from phys_avail[], but those that we create for the initial kernel page
table pages and now for the kernel and modules loaded at boot time. Since
we sometimes iterate over the physical memory segments, coalescing these
segments at initialization time is a worthwhile change.

Reviewed by: kib, markj
Approved by: re (rgrimes)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16976


# 67d33338 27-Jul-2018 Warner Losh <imp@FreeBSD.org>

Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM.

There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).

Differential Review: https://reviews.freebsd.org/D16290


# 370a338a 04-Jul-2018 Alan Cox <alc@FreeBSD.org>

Allow callers to vm_phys_split_pages() to specify whether insertion should
occur at the head or the tail of the page queues.


# 7493904e 02-Jul-2018 Alan Cox <alc@FreeBSD.org>

Introduce vm_phys_enq_range(), and call it in vm_phys_alloc_npages()
and vm_phys_alloc_seg_contig() instead of vm_phys_free_contig(). In
short, vm_phys_enq_range() is simpler and faster than the more general
vm_phys_free_contig(), and in the case of vm_phys_alloc_seg_contig(),
vm_phys_free_contig() was placing the excess physical pages at the
wrong end of the queues.

In collaboration with: Doug Moore <dougm@rice.edu>


# 9161b4de 28-Jun-2018 Alan Cox <alc@FreeBSD.org>

Three changes to vm_phys_alloc_seg_contig():

1. Optimize the order computation.

2. Update the pool for all of the chunks that are removed from the free
page lists, and not just the first chunk.

3. Simplify the code for returning excess pages to the free page lists.

Reviewed by: Doug Moore <dougm@rice.edu>


# 32d81f21 28-Jun-2018 Alan Cox <alc@FreeBSD.org>

Reflow one of the comments describing vm_phys_alloc_npages().


# 89ea39a7 26-Jun-2018 Alan Cox <alc@FreeBSD.org>

Update the physical page selection strategy used by vm_page_import() so
that it does not cause rapid fragmentation of the free physical memory.

Reviewed by: jeff, markj (an earlier version)
Differential Revision: https://reviews.freebsd.org/D15976


# 5cd29d0f 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Improve VM page queue scalability.

Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893


# c33e3a64 31-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Add a uma cache of free pages in the DEFAULT freepool. This gives us
per-cpu alloc and free of pages. The cache is filled with as few trips
to the phys allocator as possible by the use of a new
vm_phys_alloc_npages() function which allocates as many as N pages.

This code was originally by markj with the import function rewritten by
me.

Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14905


# cdfeced8 22-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Use read_mostly and alignment tags to eliminate or limit false sharing.

Reviewed by: markj (Part of D14707)
Sponsored by: Netflix, Dell/EMC Isilon


# 79e9552e 20-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Check for wrap-around in vm_phys_alloc_seg_contig().

It is possible to provide insane values for size in contigmalloc(9)
request, which usually not reaches the phys allocator due to failing
KVA allocation. But with the forthcoming 4/4 i386, where 32bit
architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable
outright and KVA might be available sometimes.

Then, the calculation of pa_end could wrap around, depending on the
physical address, and the checks in vm_phys_alloc_seg_contig() would
pass while the iteration in the loop after the 'done' label goes out
of the vm_page_array bounds.

Fix it by detecting the wrap.

Reported and tested by: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14767


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# b6715dab 13-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA.

Sponsored by: Netflix, Dell/EMC Isilon
Discussed with: jhb


# 6f4acaf4 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Add support for NUMA domains to bus dma tags. This causes all memory
allocated with a tag to come from the specified domain if it meets the
other constraints provided by the tag. Automatically create a tag at
the root of each bus specifying the domain local to that bus if
available.

Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13545


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 5be93778 04-Dec-2017 Andrew Turner <andrew@FreeBSD.org>

Print the correct value when freelist is out of range.

Security: :
Sponsored by: DARPA, AFRL


# 0db2102a 04-Dec-2017 Michael Zhilin <mizhka@FreeBSD.org>

[mips] [vm] restore translation of freelist to flind for page allocation

Commit r326346 moved domain iterators from physical layer to vm_page one,
but it also removed translation of freelist to flind for
vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter,
but after it expect freelist index.

On small WiFi boxes with few megabytes of RAM, there is only one freelist
VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file
sys/mips/include/vmparam.h). It results in freelist 1 with flind 0.

At first, this commit renames flind to freelist in vm_page_alloc_freelist
to avoid misunderstanding about input parameters. Then on physical layer it
restores translation for correct handling of freelist parameter.

Reported by: landonf
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D13351


# ef435ae7 28-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Move domain iterators into the page layer where domain selection should take
place. This makes the majority of the phys layer explicitly domain specific.

Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix & Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13014


# d2b677ce 27-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Avoid unnecessary lookups when initializing the vm_page array.

This gives a marginal improvement in the vm_page_array initialization
time. Also garbage-collect the now-unused vm_phys_paddr_to_segind().

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13270


# fe267a55 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: general adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.


# b20bf182 26-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Move vm_phys_init_page() to vm_page.c.

Suggested by: kib
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13250


# 830cb6b2 26-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Remove unneeded initializations from vm_phys_init_page().

The page allocator always initializes the aflags and oflags fields.

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13242


# f93f7cf1 07-Sep-2017 Mark Johnston <markj@FreeBSD.org>

Speed up vm_page_array initialization.

We currently initialize the vm_page array in three passes: one to zero
the array, one to initialize the "order" field of each page (necessary
when inserting them into the vm_phys buddy allocator one-by-one), and
one to initialize the remaining non-zero fields and individually insert
each page into the allocator.

Merge the three passes into one following a suggestion from alc:
initialize vm_page fields in a single pass, and use vm_phys_free_contig()
to efficiently insert physical memory segments into the buddy allocator.
This reduces the initialization time to a third or a quarter of what it
was before on most systems that I tested.

Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D12248


# a6b15641 02-Feb-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Ifdef out the unused vm_rr_selectdomain().

MFC after: 2 weeks
Sponsored by: DARPA, AFRL


# dbbaf04f 03-Sep-2016 Mark Johnston <markj@FreeBSD.org>

Remove support for idle page zeroing.

Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by: alc, imp, kib
Differential Revision: https://reviews.freebsd.org/D7714


# fc85a6f0 13-Aug-2016 Mark Johnston <markj@FreeBSD.org>

Initialize page busy lock state in vm_phys_add_page().

MFC after: 1 week


# d9c9c81c 21-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: use our roundup2/rounddown2() macros when param.h is available.

rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.


# 62d70a81 09-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Add more fine-grained kernel options for NUMA support.

VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782


# 477bffbe 15-Jan-2016 Alan Cox <alc@FreeBSD.org>

A fix to r292469: Iterate over the physical segments in descending rather
than ascending order in vm_phys_alloc_contig() so that, for example, a
sequence of contigmalloc(low=0, high=4GB) calls doesn't exhaust the supply
of low physical memory resulting in a later contigmalloc(low=0, high=1MB)
failure.

Reported by: cy
Tested by: cy
Sponsored by: EMC / Isilon Storage Division


# c869e672 19-Dec-2015 Alan Cox <alc@FreeBSD.org>

Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
memory allocator's free page lists. This replaces the long-standing
approach of scanning the inactive and inactive queues, converting clean
pages into PG_CACHED pages and laundering dirty pages. In contrast, the
new mechanism does not use PG_CACHED pages nor does it trigger a large
number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
free pages in the physical memory allocator's free page lists that are
covered by the direct map. Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places. For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by: kib, markj
Glanced at: jhb
Discussed with: jeff
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4444


# 6520495a 11-Jul-2015 Adrian Chadd <adrian@FreeBSD.org>

Add an initial NUMA affinity/policy configuration for threads and processes.

This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.

Sponsored by: Norse Corp, Inc.; Dell


# 415d7cca 07-May-2015 Adrian Chadd <adrian@FreeBSD.org>

Add initial memory locality cost awareness to the VM, and include
a basic ACPI SLIT table parser.

For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.

* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.

Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)


# ed9dd64b 14-Mar-2015 Ian Lepore <ian@FreeBSD.org>

Revert r279932; this is going to be fixed in the sbuf code instead.

PR: 195668


# f3b9fcf2 12-Mar-2015 Ian Lepore <ian@FreeBSD.org>

Nullterminate strings returned via sysctl.

PR: 195668


# d866a563 30-Dec-2014 Alan Cox <alc@FreeBSD.org>

The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.

On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.

PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division


# 271f0f12 15-Nov-2014 Alan Cox <alc@FreeBSD.org>

Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default
on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings. To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().

Discussed with: Svatopluk Kraus
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division


# 5ebe728d 05-Aug-2014 Roger Pau Monné <royger@FreeBSD.org>

vm_phys: improve robustness of fictitious ranges

With the current implementation of managed fictitious ranges when
also using VM_PHYSSEG_DENSE, a user could try to register a
fictitious range that starts inside of vm_page_array, but then
overrruns it (because the end of the fictitious range is greater than
vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE
returning unallocated pages from past the end of vm_page_array. The
same could happen if a user tried to register a segment that starts
outside of vm_page_array but ends inside of it.

In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to
use a set of pages from vm_page_array, and allocate the rest.

Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc

vm/vm_phys.c:
- Allow registering/unregistering fictitious ranges that overrun
vm_page_array.


# 38d6b2dc 09-Jul-2014 Roger Pau Monné <royger@FreeBSD.org>

vm_phys: remove limitation on number of fictitious regions

The number of vm fictitious regions was limited to 8 by default, but
Xen will make heavy usage of those kind of regions in order to map
memory from foreign domains, so instead of increasing the default
number, change the implementation to use a red-black tree to track vm
fictitious ranges.

The public interface remains the same.

Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc
Approved by: gibbs

vm/vm_phys.c:
- Replace the vm fictitious static array with a red-black tree.
- Use a rwlock instead of a mutex, since now we also need to take the
lock in vm_phys_fictitious_to_vm_page, and it can be shared.


# a17937bd 29-Apr-2014 Konstantin Belousov <kib@FreeBSD.org>

For the VM_PHYSSEG_DENSE case, checking the requested range to fall
into the area backed by vm_page_array wrongly compared end with
vm_page_array_size. It should be adjusted by first_page index to be
correct.

Also, the corner and incorrect case of the requested range extending
after the end of the vm_page_array was incorrectly handled by
allocating the segment.

Fix the comparision for the end of range and return EINVAL if the end
extends beyond vm_page_array.

Discussed with: royger
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 000fb817 31-Dec-2013 Alan Cox <alc@FreeBSD.org>

Since the introduction of the popmap to reservations in r259999, there is
no longer any need for the page's PG_CACHED and PG_FREE flags to be set and
cleared while the free page queues lock is held. Thus, vm_page_alloc(),
vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after
the free page queues lock is released to clear the page's flags. Moreover,
the PG_FREE flag can be retired. Now that the reservation system no longer
uses it, its only uses are in a few assertions. Eliminating these
assertions is no real loss. Other assertions catch the same types of
misbehavior, like doubly freeing a page (see r260032) or dirtying a free
page (free pages are invalid and only valid pages can be dirtied).

Eliminate an unneeded variable from vm_page_alloc_contig().

Sponsored by: EMC / Isilon Storage Division


# eb2f42fb 10-Oct-2013 Alan Cox <alc@FreeBSD.org>

Tidy up the output of "sysctl vm.phys_free".

Approved by: re (glebius)
Sponsored by: EMC / Isilon Storage Division


# c325e866 10-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 449c2e92 07-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# c0432fc3 06-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Fill in the description fields for M_FICT_PAGES.

Reviewed by: kib
MFC after: 3 days


# 6b5fbc12 03-Jul-2013 Neel Natu <neel@FreeBSD.org>

vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be
reset by pmap_page_init() right after being initialized in vm_page_initfake().

The statement above is with reference to the amd64 implementation of
pmap_page_init().

Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing
the 'memattr'.

Reviewed by: kib
MFC after: 2 weeks


# 7e226537 13-May-2013 Attilio Rao <attilio@FreeBSD.org>

o Add accessor functions to add and remove pages from a specific
freelist.
o Split the pool of free pages queues really by domain and not rely on
definition of VM_RAW_NFREELIST.
o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific
function that is called when calculating the allocation domain.
The RR counter is kept, currently, per-thread.
In the future it is expected that such function evolves in a real
policy decision referee, based on specific informations retrieved by
per-thread and per-vm_object attributes.
o Add the concept of "probed domains" under the form of vm_ndomains.
It is responsibility for every architecture willing to support multiple
memory domains to correctly probe vm_ndomains along with mem_affinity
segments attributes. Those two values are supposed to remain always
consistent.
Please also note that vm_ndomains and td_dom_rr_idx are both int
because segments already store domains as int. Ideally u_int would
have much more sense. Probabilly this should be cleaned up in the
future.
o Apply RR domain selection also to vm_phys_zero_pages_idle().

Sponsored by: EMC / Isilon storage division
Partly obtained from: jeff
Reviewed by: alc
Tested by: jeff


# d0b5855e 08-May-2013 Attilio Rao <attilio@FreeBSD.org>

Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of
MAXMEMDOM.
This unbreak builds.

Sponsored by: EMC / Isilon storage division
Reported by: adrian, jeli


# 941646f5 07-May-2013 Attilio Rao <attilio@FreeBSD.org>

Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in
order to match the MAXCPU concept. The change should also be useful
for consolidation and consistency.

Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: alc


# f5c4b077 03-May-2013 John Baldwin <jhb@FreeBSD.org>

Fix two bugs in the current NUMA-aware allocation code:
- vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist()
to allocate a page from a specific freelist. In the NUMA case it did not
properly map the public VM_FREELIST_* constants to the correct backing
freelists, nor did it try all NUMA domains for allocations from
VM_FREELIST_DEFAULT.
- vm_phys_alloc_pages() did not pin the thread and each call to
vm_phys_alloc_freelist_pages() fetched the current domain to choose
which freelist to use. If a thread migrated domains during the loop
in vm_phys_alloc_pages() it could skip one of the freelists. If the
other freelists were out of memory then it is possible that
vm_phys_alloc_pages() would fail to allocate a page even though pages
were available resulting in a panic in vm_page_alloc().

Reviewed by: alc
MFC after: 1 week


# 174b5f38 14-Feb-2013 John Baldwin <jhb@FreeBSD.org>

Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel
config file.

Requested by: phk (ages ago)
MFC after: 1 month


# b6de32bd 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# d6e9b97b 19-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Bah, just revert my earlier change entirely. (Missed alc's request to do
this earlier.)

Requested by: alc


# 8407f696 19-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Alter the previous commit to use vm_size_t instead of vm_pindex_t.
vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t,
but a page index instead of a byte offset.


# df96bc97 14-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Pedantic nit: use vm_pindex_t instead of long for a count of pages.


# fbd80bd0 16-Nov-2011 Alan Cox <alc@FreeBSD.org>

Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.

From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.

Reviewed by: kib
Discussed with: jhb


# 5c1f2cc4 29-Oct-2011 Alan Cox <alc@FreeBSD.org>

Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at
eliminating duplicated code in the various pmap implementations.

Micro-optimize vm_phys_free_pages().

Introduce vm_phys_free_contig(). It is fast routine for freeing an
arbitrary number of physically contiguous pages. In particular, it
doesn't require the number of pages to be a power of two.

Use "u_long" instead of "unsigned long".

Bruce Evans (bde@) has convinced me that the "boundary" parameters
to kmem_alloc_contig(), vm_phys_alloc_contig(), and
vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not
"u_long". Make this change.


# 2d510660 22-Oct-2011 Attilio Rao <attilio@FreeBSD.org>

VN_NRESERVLEVEL is used in this file but opt_vm is not included
thus the stub switch won't be correctly handled.
Include opt_vm.h.

Submitted by: jeff
MFC after: 3 days


# 00f0e671 26-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

Explicitly wire the user buffer rather than doing it implicitly in
sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.

Inspired by: rwatson


# 44e46b9e 17-Jan-2011 Alan Cox <alc@FreeBSD.org>

Explicitly initialize the page's queue field to PQ_NONE instead of relying
on PQ_NONE being zero.

Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for
PQ_NONE.

Reviewed by: kib@


# d689bc00 30-Oct-2010 Alan Cox <alc@FreeBSD.org>

Correct some format strings used by sysctls.

MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 4e657159 16-Sep-2010 Matthew D Fleming <mdf@FreeBSD.org>

Re-add r212370 now that the LOR in powerpc64 has been resolved:

Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte. This behaviour was preserved, though it should not be
necessary.

Reviewed by: phk (original patch)


# 404a593e 13-Sep-2010 Matthew D Fleming <mdf@FreeBSD.org>

Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn


# dd67e210 09-Sep-2010 Matthew D Fleming <mdf@FreeBSD.org>

Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte. This behaviour was preserved, though it should not be necessary.

Reviewed by: phk


# a3870a18 27-Jul-2010 John Baldwin <jhb@FreeBSD.org>

Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.

Reviewed by: alc


# 49ca10d4 21-Jul-2010 Jayachandran C. <jchandra@FreeBSD.org>

Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by: alc


# 3153e878 12-Jul-2009 Alan Cox <alc@FreeBSD.org>

Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.

vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by: re (kib)


# e999111a 25-Jun-2009 Alan Cox <alc@FreeBSD.org>

This change is the next step in implementing the cache control functionality
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with: jhb


# ef327c3e 21-Jun-2009 Alan Cox <alc@FreeBSD.org>

Implement a mechanism within vm_phys_alloc_contig() to defer all necessary
calls to vdrop() until after the free page queues lock is released. This
eliminates repeatedly releasing and reacquiring the free page queues lock
each time the last cached page is reclaimed from a vnode-backed object.


# 6f0489c6 20-Jun-2009 Alan Cox <alc@FreeBSD.org>

Strive for greater consistency among the places that implement real,
fictious, and contiguous page allocation. Eliminate unnecessary
reinitialization of a page's fields.


# f06a3a36 18-Jun-2009 Andrew Thompson <thompsa@FreeBSD.org>

Track the kernel mapping of a physical page by a new entry in vm_page
structure. When the page is shared, the kernel mapping becomes a special
type of managed page to force the cache off the page mappings. This is
needed to avoid stale entries on all ARM VIVT caches, and VIPT caches
with cache color issue.

Submitted by: Mark Tinguely
Reviewed by: alc
Tested by: Grzegorz Bernacki, thompsa


# ead1d027 16-Jun-2009 Alan Cox <alc@FreeBSD.org>

Make the maintenance of a page's valid bits by contigmalloc() more like
kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the
page's valid bits until contigmapping() when the mapping is known to be
successful.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 44aab2c3 06-Apr-2008 Alan Cox <alc@FreeBSD.org>

Introduce vm_reserv_reclaim_contig(). This function is used by
contigmalloc(9) as a last resort to steal pages from an inactive,
partially-used superpage reservation.

Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and
refactor it so that a separate subroutine is responsible for breaking
the selected reservation. This subroutine is also used by
vm_reserv_reclaim_contig().


# 2fbced65 04-Apr-2008 Alan Cox <alc@FreeBSD.org>

Eliminate an unnecessary test from vm_phys_unfree_page().


# 9742373a 20-Dec-2007 Alan Cox <alc@FreeBSD.org>

Update the comment describing vm_phys_unfree_page().


# e35395ce 20-Dec-2007 Alan Cox <alc@FreeBSD.org>

Modify vm_phys_unfree_page() so that it no longer requires the given
page to be in the free lists. Instead, it now returns TRUE if it
removed the page from the free lists and FALSE if the page was not
in the free lists.

This change is required to support superpage reservations. Specifically,
once reservations are introduced, a cached page can either be in the
free lists or a reservation.


# bc8794a1 19-Dec-2007 Alan Cox <alc@FreeBSD.org>

Correct one half of a loop continuation condition in vm_phys_unfree_page().
At present, this error is inconsequential; the other half of the loop
continuation condition is sufficient to achieve correct execution.


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# 8941dc44 14-Jul-2007 Alan Cox <alc@FreeBSD.org>

Eliminate two unused functions: vm_phys_alloc_pages() and
vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to
vm_phys_alloc_pages() and vm_phys_free_pages_locked() to
vm_phys_free_pages(). Add comments regarding the need for the free page
queues lock to be held by callers to these functions. No functional
changes.

Approved by: re (hrs)


# 2f9f48d6 15-Jun-2007 Alan Cox <alc@FreeBSD.org>

Update a comment.


# 11752d88 09-Jun-2007 Alan Cox <alc@FreeBSD.org>

Add a new physical memory allocator. However, do not yet connect it
to the build.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re