Cross Reference: /freebsd-current/sys/vm/vm

History log of /freebsd-current/sys/vm/vm_phys.c
Revision	Date	Author	Comments
# 543d55d7	04-Jun-2024	Doug Moore <dougm@FreeBSD.org>	vm_phys: use ilog2(x) instead of fls(x)-1 One of these changes saves two instructions on an amd64 GENERIC-NODEBUG build. The rest are entirely cosmetic, because the compiler can deduce that x is nonzero, and avoid the needless test. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D45331
# e3537f92	03-Jun-2024	Doug Moore <dougm@FreeBSD.org>	Revert "subr_pctrie: use ilog2(x) instead of fls(x)-1" This reverts commit 574ef650695088d56ea12df7da76155370286f9f.
# 574ef650	02-Jun-2024	Doug Moore <dougm@FreeBSD.org>	subr_pctrie: use ilog2(x) instead of fls(x)-1 In three instances where fls(x)-1 is used, the compiler does not know that x is nonzero and so adds needless zero checks. Using ilog(x) instead saves, in each instance, about 4 instructions, including a conditional, and 16 or so bytes, on an amd64 build. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D45330
# cb20a74c	03-Apr-2024	Stephen J. Kiernan <stevek@FreeBSD.org>	vm: add macro to mark arguments used when NUMA is defined This fixes compiler warnings when -Wunused-arguments is enabled and not quieted. Reviewed by: kib, markj Obtained from: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D44623
# 6dd15b7a	20-Dec-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys; fix uncalled free_contig Function vm_phys_free_contig does not always free memory properly when the npages parameter is less than max block size. Change it so that it does. Note that this function is not currently invoked, and this error was not triggered in earlier versions of the code. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D42891
# 2a4897bd	15-Nov-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys: fix freelist_contig vm_phys_find_freelist_contig is called to search a list of max-sized free page blocks and find one that, when joined with adjacent blocks in memory, can satisfy a request for a memory allocation bigger than any single max-sized free page block. In commit fa8a6585c7522b7de6d29802967bd5eba2f2dcf1, I defined this function in order to offer two improvements: 1) reduce the worst-case search time, and 2) allow solutions that include less-than max-sized free page blocks at the front or back of the giant allocation. However, it turns out that this change introduced an error, reported in In Bug 274592. That error concerns failing to check segment boundaries. This change fixes an error in vm_phys_find_freelist_config that resolves that bug. It also abandons improvement 2), because the value of that improvement is small and because preserving it would require more testing than I am able to do. PR: 274592 Reported by: shafaisal.us@gmail.com Reviewed by: alc, markj Tested by: shafaisal.us@gmail.com Fixes: fa8a6585c752 vm_phys: avoid waste in multipage allocation MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D42509
# c415cfc8	12-Oct-2023	Zhenlei Huang <zlei@FreeBSD.org>	vm_phys: Add corresponding sysctl knob for loader tunable The loader tunable 'vm.numa.disabled' does not have corresponding sysctl MIB entry. Add it so that it can be retrieved, and `sysctl -T` will also report it correctly. Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D42138
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# e77f4e7f	04-Aug-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys: tune vm_phys_enqueue_contig loop Rewrite the final loop in vm_phys_enqueue_contig as a new function, vm_phys_enq_beg, to reduce amd64 code size. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D41289
# ccdb2827	04-Aug-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys_enq_range: no alignment assert for npages==0 Do not assume that when vm_phys_enq_range is passed npages==0 that the vm_page argument is valid in any way, much less that it has a page-aligned address. Just don't look at it. Assert nothing about it. Reported by: karels Differential Revision: https://reviews.freebsd.org/D41317
# c9b06fa5	03-Aug-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys_enqueue_contig: handle npages==0 By letting vm_phys_enqueue_contig handle the case when npages == 0, the callers can stop checking it, and the compiler can stop zero-checking with every call to ffs(). Letting vm_phys_enqueue_contig call vm_phys_enqueue_contig for part of its work also saves a few bytes. The amd64 object code shrinks by 128 bytes. Reviewed by: kib (previous version) Tested by: pho Differential Revision: https://reviews.freebsd.org/D41154
# b7370efa	02-Aug-2023	Doug Moore <dougm@FreeBSD.org>	Revert "vm_phys_enqueue_contig: handle npages==0" This reverts commit 1a7fcf6d51eb67ee3e05fdbb806f7e68f9f53c9c. Peter Holm reported a problem, so I'm reverting now and looking for the problem later.
# 1a7fcf6d	01-Aug-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys_enqueue_contig: handle npages==0 By letting vm_phys_enqueue_contig handle the case when npages == 0, the callers can stop checking it, and the compiler can stop zero-checking with every call to ffs(). Letting vm_phys_enqueue_contig call vm_phys_enqueue_contig for part of its work also saves a few bytes. The amd64 object code shrinks by 80 bytes. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D41154
# 58d42717	16-Jun-2023	Alan Cox <alc@FreeBSD.org>	vm_phys: Fix typo in 9e8174289236
# 9e817428	16-Jun-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys: add binary segment search Replace several sequential searches for a segment that contains a phyiscal address with a call to a function that does it by binary search. In vm_page_reclaim_contig_domain_ext, find the first segment to reclaim from, and reclaim from each subsequent appropriate segment. Eliminate vm_phys_scan_contig. Reviewed by: alc, markj Differential Revision: https://reviews.freebsd.org/D40058
# 6062d9fa	05-Jun-2023	Mark Johnston <markj@FreeBSD.org>	vm_phys: Change the return type of vm_phys_unfree_page() to bool This is in keeping with the trend of removing uses of boolean_t, and the sole caller was implicitly converting it to a "bool". No functional change intended. Reviewed by: dougm, alc, imp, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D40401
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# c84c5e00	18-Jul-2022	Mitchell Horne <mhorne@FreeBSD.org>	ddb: annotate some commands with DB_CMD_MEMSAFE This is not completely exhaustive, but covers a large majority of commands in the tree. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35583
# fa8a6585	26-Apr-2022	Doug Moore <dougm@FreeBSD.org>	vm_phys: avoid waste in multipage allocation In vm_phys_alloc_contig, for an allocation bigger than the size of any buddy queue free block, avoid examining any maximum-size free block more than twice, by only starting to consider a sequence of adjacent max-blocks starting at a max-block that does not follow another max-block. If that first max-block follows adjacent blocks of smaller size, and if together they provide enough memory to reduce by one the number of max-blocks required for this allocation, use them as part of this allocation. Reviewed by: markj Tested by: pho Discussed with: alc Differential Revision: https://reviews.freebsd.org/D34815
# 52526922	18-Apr-2022	John Baldwin <jhb@FreeBSD.org>	vm_phys_init: Quiet unused but set warnings about npages. npages is used in two optional cases: - to conditionally create a separate DMA32 free list - to index vm_page_array for VM_PHYSSEG_SPARSE Add in more #ifdef's around npages statements. Reviewed by: alc, markj Differential Revision: https://reviews.freebsd.org/D34887
# 2e7838ae	08-Apr-2022	John Baldwin <jhb@FreeBSD.org>	vm_phys_early_alloc: mem_index is only used under #ifdef NUMA. Possibly mem_index should just reuse biggestone since this loop is already reusing biggestsize.
# 557dc337	31-Mar-2022	Doug Moore <dougm@FreeBSD.org>	vm_phys: check small blocks to finish allocation In vm_phys_alloc_queues_contig, in the case that a sequence of max-order blocks are sought to fulfill an allocation, a sequence is ruled out if it does not have enough max-order blocks to satisfy the allocation. However, there may be smaller blocks of free memory that follow the last max-order block in the sequence, and they may be big enough to complete the allocation request, so check for that possibility before giving up on that block sequence. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D34724
# 342056fa	31-Mar-2022	Doug Moore <dougm@FreeBSD.org>	vm_phys: alloc pages without duplicating searches. In the search for contiguous pages, as each page segment is examined, check to see if the free list set for the next page segment differs from the set for the current segment, and avoid a pointless search if they do not differ. Discussed with: alc Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D33947
# 0ce7909c	18-Jan-2022	Doug Moore <dougm@FreeBSD.org>	vm_phys: add essential segment bounds check A lower-bound segment check is necessary in vm_phys_alloc_seg_contig. Add one. Reported by: jenkins Reviewed by: alc Fixes: da92ecbc0d8f vm_phys: fix seg->end test in alloc_seg_contig MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33945
# da92ecbc	17-Jan-2022	Doug Moore <dougm@FreeBSD.org>	vm_phys: fix seg->end test in alloc_seg_contig In vm_phys_alloc_seg_contig, in allocating multiple memory blocks for a huge allocation, ensure that the end of the allocated range does not exceed the upper segment limit. Reorder a couple of checks to improve code layout. Reviewed by: alc MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33870
# e6930b1c	30-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_phys: convert error back to warning Move an assignment back to where it was before, to turn the defined-but-not-used error back into a set-but-not-used warning. Fixes: 01e115ab83a4 vm_phys: #include vm_extern
# 01e115ab	30-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_phys: #include vm_extern Arm64 and powerpc don't include vm_extern.h indirectly in vm_phys.c, which means that for the sake of those architectures, it must be included explicitly. Also, fix a set-unused warning that jenkins also found. Reported by: Jenkins Fixes: c606ab59e7f9 vm_extern: use standard address checkers everywhere
# c606ab59	30-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_extern: use standard address checkers everywhere Define simple functions for alignment and boundary checks and use them everywhere instead of having slightly different implementations scattered about. Define them in vm_extern.h and use them where possible where vm_extern.h is included. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D33685
# 8119cdd3	29-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_phys: hide vm_phys_set_pool It is only called in the file that defines it, so make it static and remove the declaration from the header. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D33688
# 31991a5a	29-Sep-2021	Mitchell Horne <mhorne@FreeBSD.org>	minidump: De-duplicate is_dumpable() The function is identical in each minidump implementation, so move it to vm_phys.c. The only slight exception is powerpc where the function was public, for use in moea64_scan_pmap(). Reviewed by: kib, markj, imp (earlier version) MFC after: 2 weeks Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D31884
# 431fb8ab	18-Nov-2020	Mark Johnston <markj@FreeBSD.org>	vm_phys: Try to clean up NUMA KPIs It can useful for code outside the VM system to look up the NUMA domain of a page backing a virtual or physical address, specifically when creating NUMA-aware data structures. We have _vm_phys_domain() for this, but the leading underscore implies that it's an internal function, and vm_phys.h has dependencies on a number of other headers. Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to vm_phys_domain(). Make the latter an inline function. Add _vm_phys.h and define struct vm_phys_seg there so that it's easier to use in other headers. Include it from vm_page.h so that vm_page_domain() can be defined there. Include machine/vmparam.h from _vm_phys.h since it depends directly on some constants defined there. Reviewed by: alc Reviewed by: dougm, kib (earlier versions) Differential Revision: https://reviews.freebsd.org/D27207
# 114484b7	23-Sep-2020	Mark Johnston <markj@FreeBSD.org>	Flag vm_reserv and vm_phys sysctls as MPSAFE. Nothing in these subsystems relies on Giant. MFC after: 1 week
# 81302f1d	28-May-2020	Mark Johnston <markj@FreeBSD.org>	Fix boot on systems where NUMA domain 0 is unpopulated. - Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to ensure that segments registered during hammer_time() are placed in the right domain. Otherwise, since the SRAT is not parsed at that point, we just add them to domain 0, which may be incorrect and results in a domain with only several MB worth of memory. - Fix uma_startup1() to try allocating memory for zones from any domain. If domain 0 is unpopulated, the allocation will simply fail, resulting in a page fault slightly later during boot. - Change _vm_phys_domain() to return -1 for addresses not covered by the affinity table, and change vm_phys_early_alloc() to handle wildcard domains. This is necessary on amd64, where the page array is dense and pmap_page_array_startup() may allocate page table pages for non-existent page frames. Reported and tested by: Rafael Kitover <rkitover@gmail.com> Reviewed by: cem (earlier version), kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25001
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# b649c2ac	22-Dec-2019	Doug Moore <dougm@FreeBSD.org>	Fix typo using RB_INITIALIZER. The macro RB_INITIALIZER ignores its argument, but is documented to require "&head" as argument to initialize "head". So using "_vm_phys_fictitious_tree" as the argument to initialize "vm_phys_fictitious_tree" is an inconsequential error, corrected here. Discussed with: alc
# 3921068f	18-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Remove unnecessary debugging from r351181 that caused powerpc build to fail. Tested by: make universe TARGETS=powerpc
# be3f5f29	18-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	vm_phys_avail_find is only used on NUMA kernels. Fix a build error.
# b7565d44	18-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Encapsulate phys_avail manipulation in a set of simple routines. Add a NUMA aware boot time memory allocator that will be used to allocate early domain correct structures. Code partially submitted by gallatin. Reviewed by: gallatin, kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21251
# 21943937	15-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Move phys_avail definition into MI code. It is consumed in the MI layer and doing so adds more flexibility with less redundant code. Reviewed by: jhb, markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21250
# c1685086	06-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Add two new kernel options to control memory locality on NUMA hardware. - UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that was freed on a different domain from where it was allocated. This is only used for UMA_ZONE_NUMA (first-touch) zones. - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all zones. This tries to maintain locality for kernel memory. Reviewed by: gallatin, alc, kib Tested by: pho, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20929
# b8590dae	31-May-2019	Doug Moore <dougm@FreeBSD.org>	The function vm_phys_free_contig invokes vm_phys_free_pages for every power-of-two page block it frees, launching an unsuccessful search for a buddy to pair up with each time. The only possible buddy-up mergers are across the boundaries of the freed region, so change vm_phys_free_contig simply to enqueue the freed interior blocks, via a new function vm_phys_enqueue_contig, and then call vm_phys_free_pages on the bounding blocks to create as big a cross-boundary block as possible after buddy-merging. The only callers of vm_phys_free_contig at the moment call it in situations where merging blocks across the boundary is clearly impossible, so just call vm_phys_enqueue_contig in those places and avoid trying to buddy-up at all. One beneficiary of this change is in breaking reservations. For the case where memory is freed in breaking a reservation with only the first and last pages allocated, the number of cycles consumed by the operation drops about 11% with this change. Suggested by: alc Reviewed by: alc Approved by: kib, markj (mentors) Differential Revision: https://reviews.freebsd.org/D16901
# 75d6d576	27-Feb-2019	Mateusz Guzik <mjg@FreeBSD.org>	vm: remove seq.h inclusion made obsolete by NUMA rewrite Sponsored by: The FreeBSD Foundation
# f2a496d6	18-Jan-2019	Konstantin Belousov <kib@FreeBSD.org>	MI VM: Make it possible to set size of superpage at boot instead of compile time. In order to allow single kernel to use PAE pagetables on i386 if hardware supports it, and fall back to classic two-level paging structures if not, superpage code should be able to adopt to either 2M or 4M superpages size. There I make MI VM structures large enough to track the biggest possible superpage, by allowing architecture to define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants. Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be defined as runtime values and must be less than the _MAX constants. If architecture does not define _MAXs, it is assumed that _MAX == normal constant. Reviewed by: markj Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18853
# 87ab1a10	23-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Initialize static domainsets regardless of whether an SRAT is present. Reported by: yuripv X-MFC with: r339452 Sponsored by: The FreeBSD Foundation
# b61f3142	22-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Make it possible to disable NUMA support with a tunable. This provides a chicken switch for anyone negatively impacted by enabling NUMA in the amd64 GENERIC kernel configuration. With NUMA disabled at boot-time, information about the NUMA topology is not exposed to the rest of the kernel, and all of physical memory is viewed as coming from a single domain. This method still has some performance overhead relative to disabling NUMA support at compile time. PR: 231460 Reviewed by: alc, gallatin, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17439
# 662e7fa8	20-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Create some global domainsets and refactor NUMA registration. Pre-defined policies are useful when integrating the domainset(9) policy machinery into various kernel memory allocators. The refactoring will make it easier to add NUMA support for other architectures. No functional change intended. Reviewed by: alc, gallatin, jeff, kib Tested by: pho (part of a larger patch) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17416
# 463406ac	24-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Add more NUMA-specific low memory predicates. Use these predicates instead of inline references to vm_min_domains. Also add a global all_domains set, akin to all_cpus. Reviewed by: alc, jeff, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17278
# 72aebdd7	02-Sep-2018	Alan Cox <alc@FreeBSD.org>	Recent changes have created, for the first time, physical memory segments that can be coalesced. To be clear, fragmentation of phys_avail[] is not the cause. This fragmentation of vm_phys_segs[] arises from the "special" calls to vm_phys_add_seg(), in other words, not those that derive directly from phys_avail[], but those that we create for the initial kernel page table pages and now for the kernel and modules loaded at boot time. Since we sometimes iterate over the physical memory segments, coalescing these segments at initialization time is a worthwhile change. Reviewed by: kib, markj Approved by: re (rgrimes) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16976
# 67d33338	27-Jul-2018	Warner Losh <imp@FreeBSD.org>	Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM. There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM except for the default boundary (16MB on x86 and 256MB on MIPS, but they are otherwise the same). We don't need both for any system we support (there were some really old ARC systems that did have ISA/EISA bus, but we never ran on them and they are too old to ever grow support for). Differential Review: https://reviews.freebsd.org/D16290
# 370a338a	04-Jul-2018	Alan Cox <alc@FreeBSD.org>	Allow callers to vm_phys_split_pages() to specify whether insertion should occur at the head or the tail of the page queues.
# 7493904e	02-Jul-2018	Alan Cox <alc@FreeBSD.org>	Introduce vm_phys_enq_range(), and call it in vm_phys_alloc_npages() and vm_phys_alloc_seg_contig() instead of vm_phys_free_contig(). In short, vm_phys_enq_range() is simpler and faster than the more general vm_phys_free_contig(), and in the case of vm_phys_alloc_seg_contig(), vm_phys_free_contig() was placing the excess physical pages at the wrong end of the queues. In collaboration with: Doug Moore <dougm@rice.edu>
# 9161b4de	28-Jun-2018	Alan Cox <alc@FreeBSD.org>	Three changes to vm_phys_alloc_seg_contig(): 1. Optimize the order computation. 2. Update the pool for all of the chunks that are removed from the free page lists, and not just the first chunk. 3. Simplify the code for returning excess pages to the free page lists. Reviewed by: Doug Moore <dougm@rice.edu>
# 32d81f21	28-Jun-2018	Alan Cox <alc@FreeBSD.org>	Reflow one of the comments describing vm_phys_alloc_npages().
# 89ea39a7	26-Jun-2018	Alan Cox <alc@FreeBSD.org>	Update the physical page selection strategy used by vm_page_import() so that it does not cause rapid fragmentation of the free physical memory. Reviewed by: jeff, markj (an earlier version) Differential Revision: https://reviews.freebsd.org/D15976
# 5cd29d0f	24-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893
# c33e3a64	31-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Add a uma cache of free pages in the DEFAULT freepool. This gives us per-cpu alloc and free of pages. The cache is filled with as few trips to the phys allocator as possible by the use of a new vm_phys_alloc_npages() function which allocates as many as N pages. This code was originally by markj with the import function rewritten by me. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14905
# cdfeced8	22-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Use read_mostly and alignment tags to eliminate or limit false sharing. Reviewed by: markj (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon
# 79e9552e	20-Mar-2018	Konstantin Belousov <kib@FreeBSD.org>	Check for wrap-around in vm_phys_alloc_seg_contig(). It is possible to provide insane values for size in contigmalloc(9) request, which usually not reaches the phys allocator due to failing KVA allocation. But with the forthcoming 4/4 i386, where 32bit architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable outright and KVA might be available sometimes. Then, the calculation of pa_end could wrap around, depending on the physical address, and the checks in vm_phys_alloc_seg_contig() would pass while the iteration in the loop after the 'done' label goes out of the vm_page_array bounds. Fix it by detecting the wrap. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14767
# e2068d0b	06-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000
# b6715dab	13-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb
# 6f4acaf4	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545
# 3f289c3f	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403
# 5be93778	04-Dec-2017	Andrew Turner <andrew@FreeBSD.org>	Print the correct value when freelist is out of range. Security: : Sponsored by: DARPA, AFRL
# 0db2102a	04-Dec-2017	Michael Zhilin <mizhka@FreeBSD.org>	[mips] [vm] restore translation of freelist to flind for page allocation Commit r326346 moved domain iterators from physical layer to vm_page one, but it also removed translation of freelist to flind for vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter, but after it expect freelist index. On small WiFi boxes with few megabytes of RAM, there is only one freelist VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file sys/mips/include/vmparam.h). It results in freelist 1 with flind 0. At first, this commit renames flind to freelist in vm_page_alloc_freelist to avoid misunderstanding about input parameters. Then on physical layer it restores translation for correct handling of freelist parameter. Reported by: landonf Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D13351
# ef435ae7	28-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Move domain iterators into the page layer where domain selection should take place. This makes the majority of the phys layer explicitly domain specific. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix & Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13014
# d2b677ce	27-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Avoid unnecessary lookups when initializing the vm_page array. This gives a marginal improvement in the vm_page_array initialization time. Also garbage-collect the now-unused vm_phys_paddr_to_segind(). Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13270
# fe267a55	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.
# b20bf182	26-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Move vm_phys_init_page() to vm_page.c. Suggested by: kib Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13250
# 830cb6b2	26-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Remove unneeded initializations from vm_phys_init_page(). The page allocator always initializes the aflags and oflags fields. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13242
# f93f7cf1	07-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Speed up vm_page_array initialization. We currently initialize the vm_page array in three passes: one to zero the array, one to initialize the "order" field of each page (necessary when inserting them into the vm_phys buddy allocator one-by-one), and one to initialize the remaining non-zero fields and individually insert each page into the allocator. Merge the three passes into one following a suggestion from alc: initialize vm_page fields in a single pass, and use vm_phys_free_contig() to efficiently insert physical memory segments into the buddy allocator. This reduces the initialization time to a third or a quarter of what it was before on most systems that I tested. Reviewed by: alc, kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D12248
# a6b15641	02-Feb-2017	Edward Tomasz Napierala <trasz@FreeBSD.org>	Ifdef out the unused vm_rr_selectdomain(). MFC after: 2 weeks Sponsored by: DARPA, AFRL
# dbbaf04f	03-Sep-2016	Mark Johnston <markj@FreeBSD.org>	Remove support for idle page zeroing. Idle page zeroing has been disabled by default on all architectures since r170816 and has some bugs that make it seemingly unusable. Specifically, the idle-priority pagezero thread exacerbates contention for the free page lock, and yields the CPU without releasing it in non-preemptive kernels. The pagezero thread also does not behave correctly when superpage reservations are enabled: its target is a function of v_free_count, which includes reserved-but-free pages, but it is only able to zero pages belonging to the physical memory allocator. Reviewed by: alc, imp, kib Differential Revision: https://reviews.freebsd.org/D7714
# fc85a6f0	13-Aug-2016	Mark Johnston <markj@FreeBSD.org>	Initialize page busy lock state in vm_phys_add_page(). MFC after: 1 week
# d9c9c81c	21-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.
# 62d70a81	09-Apr-2016	John Baldwin <jhb@FreeBSD.org>	Add more fine-grained kernel options for NUMA support. VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in the virtual memory system. DEVICE_NUMA is used to enable affinity reporting for devices such as bus_get_domain(). MAXMEMDOM must still be set to a value greater than for any NUMA support to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is enabled and the system supports NUMA. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5782
# 477bffbe	15-Jan-2016	Alan Cox <alc@FreeBSD.org>	A fix to r292469: Iterate over the physical segments in descending rather than ascending order in vm_phys_alloc_contig() so that, for example, a sequence of contigmalloc(low=0, high=4GB) calls doesn't exhaust the supply of low physical memory resulting in a later contigmalloc(low=0, high=1MB) failure. Reported by: cy Tested by: cy Sponsored by: EMC / Isilon Storage Division
# c869e672	19-Dec-2015	Alan Cox <alc@FreeBSD.org>	Introduce a new mechanism for relocating virtual pages to a new physical address and use this mechanism when: 1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical memory allocator's free page lists. This replaces the long-standing approach of scanning the inactive and inactive queues, converting clean pages into PG_CACHED pages and laundering dirty pages. In contrast, the new mechanism does not use PG_CACHED pages nor does it trigger a large number of I/O operations. 2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find free pages in the physical memory allocator's free page lists that are covered by the direct map. Tested by: adrian 3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable free pages in the physical memory allocator's free page lists. In the coming months, I expect that this new mechanism will be applied in other places. For example, balloon drivers should use relocation to minimize fragmentation of the guest physical address space. Make vm_phys_alloc_contig() a little smarter (and more efficient in some cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free page lists that can't possibly contain suitable pages. Reviewed by: kib, markj Glanced at: jhb Discussed with: jeff Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4444
# 6520495a	11-Jul-2015	Adrian Chadd <adrian@FreeBSD.org>	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell
# 415d7cca	07-May-2015	Adrian Chadd <adrian@FreeBSD.org>	Add initial memory locality cost awareness to the VM, and include a basic ACPI SLIT table parser. For now this just exports the map via sysctl; it'll eventually be useful to userland when there's more useful NUMA support in -HEAD. * Add an optional mem_locality map; * add a mapping function taking from/to domain and returning the relative cost, or -1 if it's not available; * Add a very basic SLIT parser to x86 ACPI. Differential Revision: https://reviews.freebsd.org/D2460 Reviewed by: rpaulo, stas, jhb Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
# ed9dd64b	14-Mar-2015	Ian Lepore <ian@FreeBSD.org>	Revert r279932; this is going to be fixed in the sbuf code instead. PR: 195668
# f3b9fcf2	12-Mar-2015	Ian Lepore <ian@FreeBSD.org>	Nullterminate strings returned via sysctl. PR: 195668
# d866a563	30-Dec-2014	Alan Cox <alc@FreeBSD.org>	The physical memory allocator supports the use of distinct free lists for managing pages from different address ranges. Generally speaking, this feature is used to increase the likelihood that physical pages are available that can meet special DMA requirements or can be accessed through a limited-coverage direct mapping (e.g., MIPS). However, prior to this change, the configuration of the free lists was static, i.e., it was determined at compile time. Consequentally, free lists could be created for address ranges that held no actual pages, for example, on 32-bit MIPS- based systems with 512 MB or less of physical memory. This change makes the creation of the free lists dynamic, i.e., it is based on the available physical memory at boot time. On 64-bit x86-based systems with 64 GB or more of physical memory, create free lists for managing pages with physical addresses below 4 GB. This change is to address reported problems with initializing devices that require the allocation of physical pages below 4 GB on some systems with 128 GB or more of physical memory. PR: 185727 Differential Revision: https://reviews.freebsd.org/D1274 Reviewed by: jhb, kib MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division
# 271f0f12	15-Nov-2014	Alan Cox <alc@FreeBSD.org>	Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and i386 because vm_page_startup() would not create vm_page structures for the kernel page table pages allocated during pmap_bootstrap() but those vm_page structures are needed when the kernel attempts to promote the corresponding kernel virtual addresses to superpage mappings. To address this problem, a new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is updated to reflect the creation of vm_phys_seg structures by calls to vm_phys_add_seg(). Discussed with: Svatopluk Kraus MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division
# 5ebe728d	05-Aug-2014	Roger Pau Monné <royger@FreeBSD.org>	vm_phys: improve robustness of fictitious ranges With the current implementation of managed fictitious ranges when also using VM_PHYSSEG_DENSE, a user could try to register a fictitious range that starts inside of vm_page_array, but then overrruns it (because the end of the fictitious range is greater than vm_page_array_size + first_page). This would result in PHYS_TO_VM_PAGE returning unallocated pages from past the end of vm_page_array. The same could happen if a user tried to register a segment that starts outside of vm_page_array but ends inside of it. In order to fix this, allow vm_phys_fictitious_{reg/unreg}_range to use a set of pages from vm_page_array, and allocate the rest. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc vm/vm_phys.c: - Allow registering/unregistering fictitious ranges that overrun vm_page_array.
# 38d6b2dc	09-Jul-2014	Roger Pau Monné <royger@FreeBSD.org>	vm_phys: remove limitation on number of fictitious regions The number of vm fictitious regions was limited to 8 by default, but Xen will make heavy usage of those kind of regions in order to map memory from foreign domains, so instead of increasing the default number, change the implementation to use a red-black tree to track vm fictitious ranges. The public interface remains the same. Sponsored by: Citrix Systems R&D Reviewed by: kib, alc Approved by: gibbs vm/vm_phys.c: - Replace the vm fictitious static array with a red-black tree. - Use a rwlock instead of a mutex, since now we also need to take the lock in vm_phys_fictitious_to_vm_page, and it can be shared.
# a17937bd	29-Apr-2014	Konstantin Belousov <kib@FreeBSD.org>	For the VM_PHYSSEG_DENSE case, checking the requested range to fall into the area backed by vm_page_array wrongly compared end with vm_page_array_size. It should be adjusted by first_page index to be correct. Also, the corner and incorrect case of the requested range extending after the end of the vm_page_array was incorrectly handled by allocating the segment. Fix the comparision for the end of range and return EINVAL if the end extends beyond vm_page_array. Discussed with: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 44f1c916	22-Mar-2014	Bryan Drewery <bdrewery@FreeBSD.org>	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division
# 000fb817	31-Dec-2013	Alan Cox <alc@FreeBSD.org>	Since the introduction of the popmap to reservations in r259999, there is no longer any need for the page's PG_CACHED and PG_FREE flags to be set and cleared while the free page queues lock is held. Thus, vm_page_alloc(), vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after the free page queues lock is released to clear the page's flags. Moreover, the PG_FREE flag can be retired. Now that the reservation system no longer uses it, its only uses are in a few assertions. Eliminating these assertions is no real loss. Other assertions catch the same types of misbehavior, like doubly freeing a page (see r260032) or dirtying a free page (free pages are invalid and only valid pages can be dirtied). Eliminate an unneeded variable from vm_page_alloc_contig(). Sponsored by: EMC / Isilon Storage Division
# eb2f42fb	10-Oct-2013	Alan Cox <alc@FreeBSD.org>	Tidy up the output of "sysctl vm.phys_free". Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division
# c325e866	10-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# c7aebda8	09-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 449c2e92	07-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# c0432fc3	06-Aug-2013	Mark Johnston <markj@FreeBSD.org>	Fill in the description fields for M_FICT_PAGES. Reviewed by: kib MFC after: 3 days
# 6b5fbc12	03-Jul-2013	Neel Natu <neel@FreeBSD.org>	vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be reset by pmap_page_init() right after being initialized in vm_page_initfake(). The statement above is with reference to the amd64 implementation of pmap_page_init(). Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing the 'memattr'. Reviewed by: kib MFC after: 2 weeks
# 7e226537	13-May-2013	Attilio Rao <attilio@FreeBSD.org>	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff
# d0b5855e	08-May-2013	Attilio Rao <attilio@FreeBSD.org>	Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of MAXMEMDOM. This unbreak builds. Sponsored by: EMC / Isilon storage division Reported by: adrian, jeli
# 941646f5	07-May-2013	Attilio Rao <attilio@FreeBSD.org>	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc
# f5c4b077	03-May-2013	John Baldwin <jhb@FreeBSD.org>	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week
# 174b5f38	14-Feb-2013	John Baldwin <jhb@FreeBSD.org>	Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel config file. Requested by: phk (ages ago) MFC after: 1 month
# b6de32bd	12-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a facility to register a range of physical addresses to be used for allocation of fictitious pages, for which PHYS_TO_VM_PAGE() returns proper fictitious vm_page_t. The range should be de-registered after consumer stopped using it. De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate over registered ranges. A hash container might be developed instead of range registration interface, and fake pages could be put automatically into the hash, were PHYS_TO_VM_PAGE() could look them up later. This should be considered before the MFC of the commit is done. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
# d6e9b97b	19-Mar-2012	John Baldwin <jhb@FreeBSD.org>	Bah, just revert my earlier change entirely. (Missed alc's request to do this earlier.) Requested by: alc
# 8407f696	19-Mar-2012	John Baldwin <jhb@FreeBSD.org>	Alter the previous commit to use vm_size_t instead of vm_pindex_t. vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t, but a page index instead of a byte offset.
# df96bc97	14-Mar-2012	John Baldwin <jhb@FreeBSD.org>	Pedantic nit: use vm_pindex_t instead of long for a count of pages.
# fbd80bd0	16-Nov-2011	Alan Cox <alc@FreeBSD.org>	Refactor the code that performs physically contiguous memory allocation, yielding a new public interface, vm_page_alloc_contig(). This new function addresses some of the limitations of the current interfaces, contigmalloc() and kmem_alloc_contig(). For example, the physically contiguous memory that is allocated with those interfaces can only be allocated to the kernel vm object and must be mapped into the kernel virtual address space. It also provides functionality that vm_phys_alloc_contig() doesn't, such as wiring the returned pages. Moreover, unlike that function, it respects the low water marks on the paging queues and wakes up the page daemon when necessary. That said, at present, this new function can't be applied to all types of vm objects. However, that restriction will be eliminated in the coming weeks. From a design standpoint, this change also addresses an inconsistency between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions. Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew about vnodes and reservations. Now, vm_page_alloc_contig() is responsible for these things. Reviewed by: kib Discussed with: jhb
# 5c1f2cc4	29-Oct-2011	Alan Cox <alc@FreeBSD.org>	Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at eliminating duplicated code in the various pmap implementations. Micro-optimize vm_phys_free_pages(). Introduce vm_phys_free_contig(). It is fast routine for freeing an arbitrary number of physically contiguous pages. In particular, it doesn't require the number of pages to be a power of two. Use "u_long" instead of "unsigned long". Bruce Evans (bde@) has convinced me that the "boundary" parameters to kmem_alloc_contig(), vm_phys_alloc_contig(), and vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not "u_long". Make this change.
# 2d510660	22-Oct-2011	Attilio Rao <attilio@FreeBSD.org>	VN_NRESERVLEVEL is used in this file but opt_vm is not included thus the stub switch won't be correctly handled. Include opt_vm.h. Submitted by: jeff MFC after: 3 days
# 00f0e671	26-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	Explicitly wire the user buffer rather than doing it implicitly in sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT drain for extremely large amounts of data where the caller knows that appropriate references are held, and sleeping is not an issue. Inspired by: rwatson
# 44e46b9e	17-Jan-2011	Alan Cox <alc@FreeBSD.org>	Explicitly initialize the page's queue field to PQ_NONE instead of relying on PQ_NONE being zero. Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for PQ_NONE. Reviewed by: kib@
# d689bc00	30-Oct-2010	Alan Cox <alc@FreeBSD.org>	Correct some format strings used by sysctls. MFC after: 1 week
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 4e657159	16-Sep-2010	Matthew D Fleming <mdf@FreeBSD.org>	Re-add r212370 now that the LOR in powerpc64 has been resolved: Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough SBUF_FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk (original patch)
# 404a593e	13-Sep-2010	Matthew D Fleming <mdf@FreeBSD.org>	Revert r212370, as it causes a LOR on powerpc. powerpc does a few unexpected things in copyout(9) and so wiring the user buffer is not sufficient to perform a copyout(9) while holding a random mutex. Requested by: nwhitehorn
# dd67e210	09-Sep-2010	Matthew D Fleming <mdf@FreeBSD.org>	Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk
# a3870a18	27-Jul-2010	John Baldwin <jhb@FreeBSD.org>	Very rough first cut at NUMA support for the physical page allocator. For now it uses a very dumb first-touch allocation policy. This will change in the future. - Each architecture indicates the maximum number of supported memory domains via a new VM_NDOMAIN parameter in <machine/vmparam.h>. - Each cpu now has a PCPU_GET(domain) member to indicate the memory domain a CPU belongs to. Domain values are dense and numbered from 0. - When a platform supports multiple domains, the default freelist (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain. The MD code is required to populate an array of mem_affinity structures. Each entry in the array defines a range of memory (start and end) and a domain for the range. Multiple entries may be present for a single domain. The list is terminated by an entry where all fields are zero. This array of structures is used to split up phys_avail[] regions that fall in VM_FREELIST_DEFAULT into per-domain freelists. - Each memory domain has a separate lookup-array of freelists that is used when fulfulling a physical memory allocation. Right now the per-domain freelists are listed in a round-robin order for each domain. In the future a table such as the ACPI SLIT table may be used to order the per-domain lookup lists based on the penalty for each memory domain relative to a specific domain. The lookup lists may be examined via a new vm.phys.lookup_lists sysctl. - The first-touch policy is implemented by using PCPU_GET(domain) to pick a lookup list when allocating memory. Reviewed by: alc
# 49ca10d4	21-Jul-2010	Jayachandran C. <jchandra@FreeBSD.org>	Redo the page table page allocation on MIPS, as suggested by alc@. The UMA zone based allocation is replaced by a scheme that creates a new free page list for the KSEG0 region, and a new function in sys/vm that allocates pages from a specific free page list. This also fixes a race condition introduced by the UMA based page table page allocation code. Dropping the page queue and pmap locks before the call to uma_zfree, and re-acquiring them afterwards will introduce a race condtion(noted by alc@). The changes are : - Revert the earlier changes in MIPS pmap.c that added UMA zone for page table pages. - Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that is not directly mapped (in 32bit kernel). Normal page allocations will first try the HIGHMEM freelist and then the default(direct mapped) freelist. - Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int order, int req)' to vm/vm_page.c to allocate a page from a specified freelist. The MIPS page table pages will be allocated using this function from the freelist containing direct mapped pages. - Move the page initialization code from vm_phys_alloc_contig() to a new function vm_page_alloc_init(), and use this function to initialize pages in vm_page_alloc_freelist() too. - Split the function vm_phys_alloc_pages(int pool, int order) to create vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages(). Reviewed by: alc
# 3153e878	12-Jul-2009	Alan Cox <alc@FreeBSD.org>	Add support to the virtual memory system for configuring machine- dependent memory attributes: Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the fact that there are machine-dependent memory attributes that have nothing to do with controlling the cache's behavior. Introduce vm_object_set_memattr() for setting the default memory attributes that will be given to an object's pages. Introduce and use pmap_page_{get,set}_memattr() for getting and setting a page's machine-dependent memory attributes. Add full support for these functions on amd64 and i386 and stubs for them on the other architectures. The function pmap_page_set_memattr() is also responsible for any other machine-dependent aspects of changing a page's memory attributes, such as flushing the cache or updating the direct map. The uses include kmem_alloc_contig(), vm_page_alloc(), and the device pager: kmem_alloc_contig() can now be used to allocate kernel memory with non-default memory attributes on amd64 and i386. vm_page_alloc() and the device pager will set the memory attributes for the real or fictitious page according to the object's default memory attributes. Update the various pmap functions on amd64 and i386 that map pages to incorporate each page's memory attributes in the mapping. Notes: (1) Inherent to this design are safety features that prevent the specification of inconsistent memory attributes by different mappings on amd64 and i386. In addition, the device pager provides a warning when a device driver creates a fictitious page with memory attributes that are inconsistent with the real page that the fictitious page is an alias for. (2) Storing the machine-dependent memory attributes for amd64 and i386 as a dedicated "int" in "struct md_page" represents a compromise between space efficiency and the ease of MFCing these changes to RELENG_7. In collaboration with: jhb Approved by: re (kib)
# e999111a	25-Jun-2009	Alan Cox <alc@FreeBSD.org>	This change is the next step in implementing the cache control functionality required by video card drivers. Specifically, this change introduces vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all architectures. In addition, this changes adds a vm_cache_mode_t parameter to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the interfaces for allocating mapped kernel memory and physical memory, respectively, with non-default cache modes. In collaboration with: jhb
# ef327c3e	21-Jun-2009	Alan Cox <alc@FreeBSD.org>	Implement a mechanism within vm_phys_alloc_contig() to defer all necessary calls to vdrop() until after the free page queues lock is released. This eliminates repeatedly releasing and reacquiring the free page queues lock each time the last cached page is reclaimed from a vnode-backed object.
# 6f0489c6	20-Jun-2009	Alan Cox <alc@FreeBSD.org>	Strive for greater consistency among the places that implement real, fictious, and contiguous page allocation. Eliminate unnecessary reinitialization of a page's fields.
# f06a3a36	18-Jun-2009	Andrew Thompson <thompsa@FreeBSD.org>	Track the kernel mapping of a physical page by a new entry in vm_page structure. When the page is shared, the kernel mapping becomes a special type of managed page to force the cache off the page mappings. This is needed to avoid stale entries on all ARM VIVT caches, and VIPT caches with cache color issue. Submitted by: Mark Tinguely Reviewed by: alc Tested by: Grzegorz Bernacki, thompsa
# ead1d027	16-Jun-2009	Alan Cox <alc@FreeBSD.org>	Make the maintenance of a page's valid bits by contigmalloc() more like kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the page's valid bits until contigmapping() when the mapping is known to be successful.
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 44aab2c3	06-Apr-2008	Alan Cox <alc@FreeBSD.org>	Introduce vm_reserv_reclaim_contig(). This function is used by contigmalloc(9) as a last resort to steal pages from an inactive, partially-used superpage reservation. Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and refactor it so that a separate subroutine is responsible for breaking the selected reservation. This subroutine is also used by vm_reserv_reclaim_contig().
# 2fbced65	04-Apr-2008	Alan Cox <alc@FreeBSD.org>	Eliminate an unnecessary test from vm_phys_unfree_page().
# 9742373a	20-Dec-2007	Alan Cox <alc@FreeBSD.org>	Update the comment describing vm_phys_unfree_page().
# e35395ce	20-Dec-2007	Alan Cox <alc@FreeBSD.org>	Modify vm_phys_unfree_page() so that it no longer requires the given page to be in the free lists. Instead, it now returns TRUE if it removed the page from the free lists and FALSE if the page was not in the free lists. This change is required to support superpage reservations. Specifically, once reservations are introduced, a cached page can either be in the free lists or a reservation.
# bc8794a1	19-Dec-2007	Alan Cox <alc@FreeBSD.org>	Correct one half of a loop continuation condition in vm_phys_unfree_page(). At present, this error is inconsequential; the other half of the loop continuation condition is sufficient to achieve correct execution.
# 7bfda801	25-Sep-2007	Alan Cox <alc@FreeBSD.org>	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
# 8941dc44	14-Jul-2007	Alan Cox <alc@FreeBSD.org>	Eliminate two unused functions: vm_phys_alloc_pages() and vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to vm_phys_alloc_pages() and vm_phys_free_pages_locked() to vm_phys_free_pages(). Add comments regarding the need for the free page queues lock to be held by callers to these functions. No functional changes. Approved by: re (hrs)
# 2f9f48d6	15-Jun-2007	Alan Cox <alc@FreeBSD.org>	Update a comment.
# 11752d88	09-Jun-2007	Alan Cox <alc@FreeBSD.org>	Add a new physical memory allocator. However, do not yet connect it to the build. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re