Cross Reference: /freebsd-current/sys/vm/vm

History log of /freebsd-current/sys/vm/vm_pageout.c
Revision	Date	Author	Comments
# a216e311	24-May-2024	Ryan Libby <rlibby@FreeBSD.org>	vm_pageout_scan_inactive: take a lock break In vm_pageout_scan_inactive, release the object lock when we go to refill the scan batch queue so that someone else has a chance to acquire it. This improves access latency to the object when the pagedaemon is processing many consecutive pages from a single object, and also in any case avoids a hiccup during refill for the last touched object. Reviewed by: alc, markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D45288
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 1cac76c9	14-Dec-2022	Andrew Gallatin <gallatin@FreeBSD.org>	vm: reduce lock contention when processing vm batchqueues Rather than waiting until the batchqueue is full to acquire the lock & process the queue, we now start trying to acquire the lock using trylocks when the batchqueue is 1/2 full. This removes almost all contention on the vm pagequeue mutex for for our busy sendfile() based web workload. It also greadly reduces the amount of time a network driver ithread remains blocked on a mutex, and eliminates some packet drops under heavy load. So that the system does not loose the benefit of processing large batchqueues, I've doubled the size of the batchqueues. This way, when there is no contention, we process the same batch size as before. This has been run for several months on a busy Netflix server, as well as on my personal desktop. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D37305
# 0cb2610e	16-Jul-2022	Mark Johnston <markj@FreeBSD.org>	vm: Remove handling for OBJT_DEFAULT objects Now that OBJT_DEFAULT objects can't be instantiated, we can simplify checks of the form object->type == OBJT_DEFAULT \|\| (object->flags & OBJ_SWAP) != 0. No functional change intended. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35788
# e123264e	19-Jun-2022	Mark Johnston <markj@FreeBSD.org>	vm: Fix racy checks for swap objects Commit 4b8365d752ef introduced the ability to dynamically register VM object types, for use by tmpfs, which creates swap-backed objects. As a part of this, checks for such objects changed from object->type == OBJT_DEFAULT \|\| object->type == OBJT_SWAP to object->type == OBJT_DEFAULT \|\| (object->flags & OBJ_SWAP) != 0 In particular, objects of type OBJT_DEFAULT do not have OBJ_SWAP set; the swap pager sets this flag when converting from OBJT_DEFAULT to OBJT_SWAP. A few of these checks are done without the object lock held. It turns out that this can result in false negatives since the swap pager converts objects like so: object->type = OBJT_SWAP; object->flags \|= OBJ_SWAP; Fix the problem by adding explicit tests for OBJT_SWAP objects in unlocked checks. PR: 258932 Fixes: 4b8365d752ef ("Add OBJT_SWAP_TMPFS pager") Reported by: bdrewery Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35470
# b51927b7	10-Feb-2022	Konstantin Belousov <kib@FreeBSD.org>	Revert "vm_pageout_scans: correct detection of active object" This reverts commit 3de96d664aaaf8e3fb1ca4fc4bd864d2cf734b24. Problem is that it is possible to reach the state with ref_count == 1 for the mapped non-anonymous object. For instance, anonymous posix shmfd or linux shmfs object could be mapped, and then corresponding file descriptor closed, dropping the object reference owned by the shmfd/shmfs file. Then the check in inactive scan assumes that the object and page are not mapped and frees the page, while they are not. PR: 261707 Discussed with: markj Sponsored by: The FreeBSD Foundation MFC after: now
# 3de96d66	16-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	vm_pageout_scans: correct detection of active object For non-anonymous swap objects, there is always a reference from the owner to the object to keep it from recycling. Account for it when deciding should we query pmap for hardware active references for the page. As result, we avoid unneeded calls to pmap_ts_referenced(), which for non-mapped page means avoiding unneccessary lock and unlock of the pv list. Reviewed by: markj Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33924
# 4a864f62	14-Jan-2022	Mark Johnston <markj@FreeBSD.org>	vm_pageout: Print a more accurate message to the console before an OOM kill Previously we'd always print "out of swap space." This can be misleading, as there are other reasons an OOM kill can be triggered. In particular, it's entirely possible to trigger an OOM kill on a system with plenty of free swap space. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33810
# c4a25e07	11-Jan-2022	Mark Johnston <markj@FreeBSD.org>	vm_pageout: Group sysctl variables together with sysctl definitions Fix some style bugs while here. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33811
# fa7a635f	14-Aug-2021	Gordon Bergling <gbe@FreeBSD.org>	Fix a few typos in source code comments - s/becase/because/ MFC after: 5 days
# 0ef5eee9	03-Aug-2021	Konstantin Belousov <kib@FreeBSD.org>	Add vn_lktype_write() and remove repetetive code that calculates vnode locking type for write. Reviewed by: khng, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31405
# 7079449b	07-May-2021	Konstantin Belousov <kib@FreeBSD.org>	sys/vm: remove several other uses of OBJT_SWAP_TMPFS Mostly in cases where OBJ_SWAP flag works as well, or by reversing the condition so that object types can be listed. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30168
# 4b8365d7	30-Apr-2021	Konstantin Belousov <kib@FreeBSD.org>	Add OBJT_SWAP_TMPFS pager This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager and generic vm code have to explicitly handle swap objects which are tmpfs vnode v_object, in the special ways. Replace (almost) all such places with proper methods. Since VM still needs a notion of the 'swap object', regardless of its use, add yet another type-classification flag OBJ_SWAP. Set it in vm_object_allocate() where other type-class flags are set. This change almost completely eliminates the knowledge of tmpfs from VM, and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070
# 2913cc46	02-Oct-2020	Mark Johnston <markj@FreeBSD.org>	vm_pageout: Avoid rounding down the inactive scan target With helper page daemon threads, enabled by default in r364786, we divide the inactive target by the number of threads, rounding down, and sum the total number of pages freed by the threads. This sum is compared with the original target, but by rounding down we might lose pages, causing the page daemon control loop to conclude that inactive queue scanning isn't keeping up with demand for free pages. Typically this results in excessive swapping. Fix the problem by accounting for the error in the main pagedaemon thread's target. Note that by default the problem will manifest only in systems with >16 CPUs in a NUMA domain. Reviewed by: cem Discussed with: dougm Reported and tested by: dhw, glebius Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26610
# 97458520	17-Sep-2020	Mark Johnston <markj@FreeBSD.org>	Increase the default vm.max_user_wired value. Since r347532 (merged to stable/12) we only count user-wired pages towards the system limit. However, we now also treat pages wired by hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with large amounts of RAM tends to fail due to the low limit. The purpose of the limit is to provide a seatbelt, not to impose some policy on the use of wired memory. Thus, increase the default limit to allow reasonable VM configurations to work without tuning. Reviewed by: kib Discussed with: dougm MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26424
# 609de97e	28-Aug-2020	Eric van Gyzen <vangyzen@FreeBSD.org>	vm_pageout_scan_active: ensure ps_delta is initialized Reported by: Coverity Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26212
# 74f5530d	25-Aug-2020	Conrad Meyer <cem@FreeBSD.org>	vm_pageout: Scale worker threads with CPUs Autoscale vm_pageout worker threads from r364129 with CPU count. The default is arbitrarily chosen to be 16 CPUs per worker thread, but can be adjusted with the vm.pageout_cpus_per_thread tunable. There will never be less than 1 thread per populated NUMA domain, and the previous arbitrary upper limit (at most ncpus/2 threads per NUMA domain) is preserved. Care is taken to gracefully handle asymmetric NUMA nodes, such as empty node systems (e.g., AMD 2990WX) and systems with nodes of varying size (e.g., some larger >20 core Intel Haswell/Broadwell Xeon). Reviewed by: kib, markj Sponsored by: Isilon Differential Revision: https://reviews.freebsd.org/D26152
# a92a971b	16-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the thread argument from vget It was already asserted to be curthread. Semantic patch: @@ expression arg1, arg2, arg3; @@ - vget(arg1, arg2, arg3) + vget(arg1, arg2)
# ea7b737a	14-Aug-2020	Conrad Meyer <cem@FreeBSD.org>	vm_pageout: Correct threshold calculation on single-CPU systems Reported by: Michael Butler X-MFC-With: r364129
# 0292c54b	11-Aug-2020	Conrad Meyer <cem@FreeBSD.org>	Add support for multithreading the inactive queue pageout within a domain. In very high throughput workloads, the inactive scan can become overwhelmed as you have many cores producing pages and a single core freeing. Since Mark's introduction of batched pagequeue operations, we can now run multiple inactive threads working on independent batches. To avoid confusing the pid and other control algorithms, I (Jeff) do this in a mpi-like fan out and collect model that is driven from the primary page daemon. It decides whether the shortfall can be overcome with a single thread and if not dispatches multiple threads and waits for their results. The heuristic is based on timing the pageout activity and averaging a pages-per-second variable which is exponentially decayed. This is visible in sysctl and may be interesting for other purposes. I (Jeff) have verified that this does indeed double our paging throughput when used with two threads. With four we tend to run into other contention problems. For now I would like to commit this infrastructure with only a single thread enabled. The number of worker threads per domain can be controlled with the 'vm.pageout_threads_per_domain' tunable. Submitted by: jeff (earlier version) Discussed with: markj Tested by: pho Sponsored by: probably Netflix (based on contemporary commits) Differential Revision: https://reviews.freebsd.org/D21629
# efec381d	04-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Remove most lingering references to the page lock in comments. Finish updating comments to reflect new locking protocols introduced over the past year. In particular, vm_page_lock is now effectively unused. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25868
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# 23ed568c	14-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vm: remove no longer needed atomic_load_ptr casts
# 3c200db9	10-Feb-2020	Jonathan T. Looney <jtl@FreeBSD.org>	Modify the vm.panic_on_oom sysctl to take a count of events. Currently, the vm.panic_on_oom sysctl is a boolean which controls the behavior of the VM system when it encounters an out-of-memory situation. If set to 0, the VM system kills the largest process. If set to any other value, the VM system will initiate a panic. This change makes the sysctl a count of events. If set to 0, the VM system kills the largest process. If set to any other value, the VM system will kill the largest process until it has seen the specified number of out-of-memory events. Once it reaches the specified number of events, it will initiate a panic. This change is helpful in capturing cores when the system is in a perpetual cycle of out-of-memory events (as opposed to just hitting one or two sporadic out-of-memory events). Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23601
# ace409ce	13-Jan-2020	Alexander Motin <mav@FreeBSD.org>	Restore loop break in vm_pageout_lowmem(). r355004 removed return statement from this loop with intention to also call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes infinite loop. Reviewed by: markj Sponsored by: iXsystems, Inc.
# f7607c30	02-Jan-2020	Mark Johnston <markj@FreeBSD.org>	Clear queue operation flags when migrating a page to another queue. The page daemon loops may move pages back to the active queue if references are detected. In this case we must take care to clear existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be set, and that flag is only valid if the page belongs to the inactive queue. Also fix a bug in the active queue scan where we were updating "old" instead of "new". This would only have been hit in rare cases where the page moved out of the active queue after the beginning of the scan. Reported by: Bob Prohaska, Idwer Vollering Tested by: Idwer Vollering Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D23001
# 9f5632e6	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885
# b7f30bff	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Generalize lazy dequeue logic for wired pages. Some recent work aims to remove the use of the page lock for synchronizing updates to page queue state. This change adds a mechanism to preserve the existing behaviour of lazily dequeuing wired pages, which was previously synchronized using the page lock. Handle this by setting PGA_DEQUEUE when a managed page's wire count transitions from 0 to 1. When the page daemon encounters a page with a flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that page, but in so doing it does not modify the page itself and thus racing with a concurrent free of the page is harmless. The flag is advisory; the page daemon still checks for wirings after acquiring the object and page xbusy locks. vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition. It must do this before dropping the reference to avoid a use-after-free but also handles races with concurrent wirings to ensure that PGA_DEQUEUE is not left unset on a wired page. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22882
# f3f38e25	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Start implementing queue state updates using fcmpset loops. This is in preparation for eliminating the use of the vm_page lock for protecting queue state operations. Introduce the vm_page_pqstate_commit_*() functions. These functions act as helpers around vm_page_astate_fcmpset() and are specialized for specific types of operations. vm_page_pqstate_commit() wraps these functions. Convert a number of routines to use these new helpers. Use vm_page_release_toq() in vm_page_unwire() and vm_page_release() to atomically release a wiring reference and release the page into a queue. This has the side effect that vm_page_unwire() will leave the page in the active queue if it is already present there. Convert the page queue scans to use the new helpers. Simplify vm_pageout_reinsert_inactive(), which requeues pages that were found to be busy during an inactive queue scan, to avoid duplicating the work of vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or PGA_REQUEUE_HEAD is set, let that be handled during batch processing. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22770 Differential Revision: https://reviews.freebsd.org/D22771 Differential Revision: https://reviews.freebsd.org/D22772 Differential Revision: https://reviews.freebsd.org/D22773 Differential Revision: https://reviews.freebsd.org/D22776
# a8081778	14-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Add a deferred free mechanism for freeing swap space that does not require an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654
# 5cff1f4d	10-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Introduce vm_page_astate. This is a 32-bit structure embedded in each vm_page, consisting mostly of page queue state. The use of a structure makes it easy to store a snapshot of a page's queue state in a stack variable and use cmpset loops to update that state without requiring the page lock. This change merely adds the structure and updates references to atomic state fields. No functional change intended. Reviewed by: alc, jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22650
# 9c770a27	22-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Simplify vm_pageout_init_domain() and add a "big picture" comment. Stop subtracting 1024/200 from vmd_page_count/200. I cannot see how such precise accounting can make a difference on modern systems. Add some explanation of what the page daemon does and how it handles memory shortages. Reviewed by: dougm Discussed with: jeff, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22396
# 8fc25508	22-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Reclaim memory from UMA if the page daemon is struggling. Use the UMA reclaim thread to asynchronously drain all caches if there is a severe shortage in a domain. Otherwise we only trigger UMA reclamation every 10s even when the system has completely run out of memory. Stop entirely draining the caches when one domain falls below its min threshold. In some workloads it is normal for one NUMA domain to end up being nearly depleted by kernel memory allocations, for example for the ZFS ARC. The domainset iterators skip domains below the vmd_min_free theshold on the first iteration, so we should allow that mechanism to limit further depletion of the domain's free pages before taking the extreme step of calling uma_reclaim(UMA_RECLAIM_DRAIN_CPU). Discussed with: jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22395
# 0012f373	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594
# 63e97555	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548
# 2288078c	08-Oct-2019	Doug Moore <dougm@FreeBSD.org>	Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map. In case the implementation ever changes from using a chain of next pointers, then changing the macro definition will be necessary, but changing all the files that iterate over vm_map entries will not. Drop a counter in vm_object.c that would have an effect only if the vm_map entry count was wrong. Discussed with: alc Reviewed by: markj Tested by: pho (earlier version) Differential Revision: https://reviews.freebsd.org/D21882
# e8bcf696	16-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Revert r352406, which contained changes I didn't intend to commit.
# 41fd4b94	16-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Fix a couple of nits in r352110. - Remove a dead variable from the amd64 pmap_extract_and_hold(). - Fix grammar in the vm_page_wire man page. Reported by: alc Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639
# fee2a2fa	09-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486
# 7cdeaf33	03-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Add preliminary support for atomic updates of per-page queue state. Queue operations on a page use the page lock when updating the page to reflect the desired queue state, and the page queue lock when physically enqueuing or dequeuing a page. Multiple pages share a given page lock, but queue state is per-page; this false sharing results in heavy lock contention. Take a small step towards the use of atomic_cmpset to synchronize updates to per-page queue state by introducing vm_page_pqstate_cmpset() and using it in the page daemon. In the longer term the plan is to stop using the page lock to protect page identity and rely only on the object and page busy locks. However, since the page daemon avoids acquiring the object lock except when necessary, some synchronization with a concurrent free of the page is required. vm_page_pqstate_cmpset() can be used to ensure that queue state updates are successful only if the page is not scheduled for a dequeue, which is sufficient for the page daemon. Add vm_page_swapqueue(), which moves a page from one queue to another using vm_page_pqstate_cmpset(). Use it in the active queue scan, which does not use the object lock. Modify vm_page_dequeue_deferred() to use vm_page_pqstate_cmpset() as well. Reviewed by: kib Discussed with: jeff Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21257
# 08cfa56e	01-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667
# 245139c6	16-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	Fix OOM handling of some corner cases. In addition to pagedaemon initiating OOM, also do it from the vm_fault() internals. Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13671
# eeacb3b0	08-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247
# 0cab71bc	06-Jul-2019	Doug Moore <dougm@FreeBSD.org>	Fix style(9) violations involving division by PAGE_SIZE. Reviewed by: alc Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20847
# d70f0ab3	03-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Cache the next queue element when traversing a page queue. When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element invalidates its queue linkage pointers. vm_pageout_collect_batch() was relying on these pointers remaining valid after a removal, so modify it to fetch the next queued page before dequeuing the current page. Submitted by: Don Morris <dgmorris@earthlink.net> Reviewed by: cem, vangyzen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20842
# d842aa51	01-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Add a vm_page_wired() predicate. Use it instead of accessing the wire_count field directly. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20485
# 54a3a114	13-May-2019	Mark Johnston <markj@FreeBSD.org>	Provide separate accounting for user-wired pages. Historically we have not distinguished between kernel wirings and user wirings for accounting purposes. User wirings (via mlock(2)) were subject to a global limit on the number of wired pages, so if large swaths of physical memory were wired by the kernel, as happens with the ZFS ARC among other things, the limit could be exceeded, causing user wirings to fail. The change adds a new counter, v_user_wire_count, which counts the number of virtual pages wired by user processes via mlock(2) and mlockall(2). Only user-wired pages are subject to the system-wide limit which helps provide some safety against deadlocks. In particular, while sources of kernel wirings typically support some backpressure mechanism, there is no way to reclaim user-wired pages shorting of killing the wiring process. The limit is exported as vm.max_user_wired, renamed from vm.max_wired, and changed from u_int to u_long. The choice to count virtual user-wired pages rather than physical pages was done for simplicity. There are mechanisms that can cause user-wired mappings to be destroyed while maintaining a wiring of the backing physical page; these make it difficult to accurately track user wirings at the physical page layer. The change also closes some holes which allowed user wirings to succeed even when they would cause the system limit to be exceeded. For instance, mmap() may now fail with ENOMEM in a process that has called mlockall(MCL_FUTURE) if the new mapping would cause the user wiring limit to be exceeded. Note that bhyve -S is subject to the user wiring limit, which defaults to 1/3 of physical RAM. Users that wish to exceed the limit must tune vm.max_user_wired. Reviewed by: kib, ngie (mlock() test changes) Tested by: pho (earlier version) MFC after: 45 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19908
# 64f8d257	02-May-2019	Doug Moore <dougm@FreeBSD.org>	fls() should find the most significant bit of an int faster than a linear search can, so use it to avoid a linear search in isqrt. Approved by: kib (mentor), markj (mentor) Differential Revision: https://reviews.freebsd.org/D20102
# 46e39081	21-Feb-2019	Mark Johnston <markj@FreeBSD.org>	Clear pointers to indicate that the respective locks are released. This fixes a problem in r344231: vm_pageout_launder() may scan two queues when swap is disabled. Reported by: pho MFC with: r344231
# 60256604	17-Feb-2019	Mark Johnston <markj@FreeBSD.org>	Remove a redundant flag variable. Use the object pointer itself to determine whether the object is locked. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19215
# 8002c3a4	05-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Initialize last_target in the laundry thread control loop. In practice it is always initialized because nfreed must be positive in order to trigger background laundering, but this isn't obvious. CID: 1387997 MFC after: 1 week
# 920239ef	30-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704
# 30c5525b	01-Oct-2018	Andrew Gallatin <gallatin@FreeBSD.org>	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683
# 899fe184	23-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Add a per-pagequeue pdpages counter. Expose these counters under the vm.domain sysctl node. The existing vm.stats.vm.v_pdpages sysctl is preserved. Reviewed by: alc (previous version) Differential Revision: https://reviews.freebsd.org/D14666
# b50a4ea6	09-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Account for the lowmem handlers in the inactive queue scan target. Before r329882 the target would be computed after lowmem handlers run and free pages. On some systems a significant amount of page reclamation happens this way. However, with r329882 the target is computed first, which can lead to unnecessary reclamation from the page cache, and this in turn may result in excessive swapping. Instead, adjust the target after running lowmem handlers. Don't invoke the lowmem handlers before the PID controller, though, since that would hide the true rate of page allocation. Reviewed by: alc, kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16606
# d7aeb429	15-Jul-2018	Alan Cox <alc@FreeBSD.org>	Test PGA_REFERENCED after calling pmap_ts_referenced(), rather than before, so that a reference from a concurrently destroyed mapping is observed during the current scan. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16277
# 27e29d10	04-Jun-2018	Mark Johnston <markj@FreeBSD.org>	Correct the description of vm_pageout_scan_inactive() after r334508. Reported by: alc
# 49a3710c	01-Jun-2018	Mark Johnston <markj@FreeBSD.org>	Remove the "pass" variable from the page daemon control loop. It serves little purpose after r308474 and r329882. As a side effect, the removal fixes a bug in r329882 which caused the page daemon to periodically invoke lowmem handlers even in the absence of memory pressure. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15491
# 7bb4634e	24-May-2018	Mark Johnston <markj@FreeBSD.org>	Update r334154 with review feedback from D15490. An old revision was committed by accident. Differential Revision: https://reviews.freebsd.org/D15490
# be37ee79	24-May-2018	Mark Johnston <markj@FreeBSD.org>	Split the active and inactive queue scans into separate subroutines. The scans are largely independent, so this helps make the code marginally neater, and makes it easier to incorporate feedback from the active queue scan into the page daemon control loop. Improve some comments while here. No functional change intended. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D15490
# 01f04471	18-May-2018	Mark Johnston <markj@FreeBSD.org>	Don't increment addl_page_shortage for wired pages. Such pages are dequeued as they're encountered during the inactive queue scan, so by the time we get to the active queue scan, they should have already been subtracted from the inactive queue length. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15479
# 36f8fe9b	13-May-2018	Mark Johnston <markj@FreeBSD.org>	Get rid of vm_pageout_page_queued(). vm_page_queue(), added in r333256, generalizes vm_pageout_page_queued(), so use it instead. No functional change intended. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D15402
# 1b5c869d	04-May-2018	Mark Johnston <markj@FreeBSD.org>	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280
# 5cd29d0f	24-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893
# 64b38930	19-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Initialize marker pages in vm_page_domain_init(). They were previously initialized by the corresponding page daemon threads, but for vmd_inacthead this may be too late if vm_page_deactivate_noreuse() is called during boot. Reported and tested by: cperciva Reviewed by: alc, kib MFC after: 1 week
# c098768e	02-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Ensure the background laundering threshold is positive after a scan. The division added in r331732 meant that we wouldn't attempt a background laundering until at least v_free_target - v_free_min clean pages had been freed by the page daemon since the last laundering. If the inactive queue is depleted but not completely empty (e.g., because it contains busy pages), it can thus take a long time to meet this threshold. Restore the pre-r331732 behaviour of using a non-zero background laundering threshold if at least one inactive queue scan has elapsed since the last attempt at background laundering. Submitted by: tijl (original version)
# 60684862	29-Mar-2018	Mark Johnston <markj@FreeBSD.org>	Fix the background laundering mechanism after r329882. Rather than using the number of inactive queue scans as a metric for how many clean pages are being freed by the page daemon, have the page daemon keep a running counter of the number of pages it has freed, and have the laundry thread use that when computing the background laundering threshold. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14884
# 30fbfdda	15-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612
# 3b8cf4ac	27-Feb-2018	Mark Johnston <markj@FreeBSD.org>	Give the 0th domain's page daemon thread a consistent name. Page daemon threads for other domains show up in ps(1) output as "pagedaemon/domN", so let that be the case for domain 0 as well. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14518
# 59d3150b	24-Feb-2018	Mark Johnston <markj@FreeBSD.org>	Restore the pre-r329882 inactive page shortage computation. With r329882, in the absence of a free page shortage we would only take len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to aggressively scan PQ_ACTIVE. Previously we would also include the number of free pages in this computation, ensuring that we wouldn't scan PQ_ACTIVE with plenty of free memory available. The change in behaviour was most noticeable immediately after booting, when PQ_INACTIVE and PQ_LAUNDRY are nearly empty. Reviewed by: jeff
# 5f8cd1c0	23-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402
# 2c0f13aa	20-Feb-2018	Konstantin Belousov <kib@FreeBSD.org>	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384
# 1d3a1bcf	07-Feb-2018	Mark Johnston <markj@FreeBSD.org>	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943
# e2068d0b	06-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000
# b6715dab	13-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb
# 5c515efc	29-Dec-2017	Alan Cox <alc@FreeBSD.org>	After r327168, the variable "vm_pageout_wanted" can be static. MFC after: 2 weeks
# 65ef3231	26-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Ensure that pass > 0 when starting a scan with vm_pages_needed == 1. Otherwise the page daemon will not reclaim pages and thus will not wake threads sleeping in VM_WAIT. Reported and tested by: pho Reviewed by: alc, kib X-MFC with: r327168 Differential Revision: https://reviews.freebsd.org/D13640
# 280d15cd	24-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Fix two problems with the page daemon control loop. Both issues caused the page daemon to erroneously go to sleep when applications are consuming free pages at a high rate, leaving the application threads blocked in VM_WAIT. 1) After completing an inactive queue scan, concurrent allocations may have prevented the page daemon from meeting the v_free_min threshold. In this case, the page daemon was going to sleep even when the inactive queue contained plenty of clean pages. 2) pagedaemon_wakeup() may be called without the free queues lock held. This can lead to a lost wakeup if a call occurs after the page daemon clears vm_pageout_wanted but before going to sleep. Fix 1) by ensuring that we start a new inactive queue scan immediately if v_free_count < v_free_min after a prior scan. Fix 2) by adding a new subroutine, pagedaemon_wait(), called from vm_wait() and vm_waitpfault(). It wakes up the page daemon if either vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps on v_free_count. Reported by: jeff Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13424
# cb35676e	11-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Use a dedicated counter for inactive queue scans. The laundry thread keeps track of the number of inactive queue scans performed by the page daemon, and was previously using the v_pdwakeups counter to count them. However, in some cases the inactive queue may be scanned multiple times after a single wakeup, so it's more accurate to use a dedicated counter. Reviewed by: alc, kib (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13422
# 82e2d06a	09-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Fix the act_scan_laundry_weight mechanism. r292392 modified the active queue scan to weigh clean pages differently from dirty pages when attempting to meet the inactive queue target. When r306706 was merged into the PQ_LAUNDRY branch, this mechanism was broken. Fix it by scalaing the correct page shortage variable. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13423
# 6eebec83	06-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Use unique wait messages in the page daemon control loop. Discussed with: alc MFC after: 1 week
# 796df753	30-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.
# 57cd81a3	29-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Verify the object/vnode association after vget() in vm_pageout_clean(). It's theoretically possible for the vnode and object to be disassociated while locks are dropped around the vget() call, in which case we shouldn't proceed with laundering. Noted and reviewed by: kib MFC after: 1 week
# df57947f	18-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133
# e0b2fc3a	08-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Allow various page daemon parameters to be set from loader.conf. MFC after: 1 week
# ac04195b	20-Oct-2017	Konstantin Belousov <kib@FreeBSD.org>	Move swapout code into vm/vm_swapout.c. There is no NO_SWAPPING #ifdef left in the code. Requested by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12663
# aed9aaaa	28-Aug-2017	Mark Johnston <markj@FreeBSD.org>	Synchronize page laundering with pmap_extract_and_hold(). Before r207410, the hold count of a page in a page queue was protected by the queue lock, and, before laundering a page, the page daemon removed managed writeable mappings of the page before releasing the queue lock. This ensured that other threads could not concurrently create transient writeable mappings using pmap_extract_and_hold() on a user map, as is done for example by vmapbuf(). With that revision, however, a race can allow the creation of such a mapping, meaning that the page might be modified as it is being laundered, potentially resulting in it being marked clean when its contents do not match those given to the pager. Close the race by using the page lock to synchronize the hold count check in vm_pageout_cluster() with the removal of writeable managed mappings. Reported by: alc Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12084
# e2241590	24-Jun-2017	Alan Cox <alc@FreeBSD.org>	Increase the pageout cluster size to 32 pages. Decouple the pageout cluster size from the size of the hash table entry used by the swap pager for mapping (object, pindex) to a block on the swap device(s), and keep the size of a hash table entry at its current size. Eliminate a pointless macro. Reviewed by: kib, markj (an earlier version) MFC after: 4 weeks Differential Revision: https://reviews.freebsd.org/D11305
# 3e78e983	05-Jun-2017	Alan Cox <alc@FreeBSD.org>	The variable "breakout" is used like a Boolean, so actually define it as one. Reviewed by: kib MFC after: 5 days
# 83c9dea1	17-Apr-2017	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156
# 9b43bc27	25-Feb-2017	Andriy Gapon <avg@FreeBSD.org>	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764
# 937c1b07	14-Feb-2017	Andriy Gapon <avg@FreeBSD.org>	try to fix RACCT_RSS accounting There could be a race between the vm daemon setting RACCT_RSS based on the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to zero. In that case we can get a zombie process with non-zero RACCT_RSS. If the process is jailed, that may break accounting for the jail. There could be other consequences. Fix this race in the vm daemon by updating RACCT_RSS only when a process is in the normal state. Also, make accounting a little bit more accurate by refreshing the page resident count after calling vm_pageout_map_deactivate_pages(). Finally, add an assert that the RSS is zero when a process is reaped. PR: 210315 Reviewed by: trasz Differential Revision: https://reviews.freebsd.org/D9464
# b1fd102e	02-Jan-2017	Mark Johnston <markj@FreeBSD.org>	Add a page queue for holding dirty anonymous unswappable pages. On systems without a configured swap device, an attempt to launder pages from a swap object will always fail and result in the page being reactivated. This means that the page daemon will continuously scan pages that can never be evicted. With this change, anonymous pages are instead moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device is configured, so unreferenced unswappable pages are excluded from the page daemon's workload. Reviewed by: alc
# 99e6e193	23-Nov-2016	Mark Johnston <markj@FreeBSD.org>	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589
# ebcddc72	09-Nov-2016	Alan Cox <alc@FreeBSD.org>	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302
# 28323add	08-Nov-2016	Bryan Drewery <bdrewery@FreeBSD.org>	Fix improper use of "its". Sponsored by: Dell EMC Isilon
# 70cf3ced	05-Oct-2016	Alan Cox <alc@FreeBSD.org>	Make the page daemon's notion of what kind of pass is being performed by vm_pageout_scan() local to vm_pageout_worker(). There is no reason to store the pass in the NUMA domain structure. Reviewed by: kib MFC after: 3 weeks
# e57dd910	05-Oct-2016	Alan Cox <alc@FreeBSD.org>	Change vm_pageout_scan() to return a value indicating whether the free page target was met. Previously, vm_pageout_worker() itself checked the length of the free page queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan freed enough pages to meet the free page target. Specifically, vm_pageout_worker() used vm_paging_needed(). The trouble with vm_paging_needed() is that it compares the length of the free page queues to the wakeup threshold for the page daemon, which is much lower than the free page target. Consequently, vm_pageout_worker() could conclude that the inactive queue scan succeeded in meeting its free page target when in fact it did not; and rather than immediately triggering an all-out laundering pass over the inactive queue, vm_pageout_worker() would go back to sleep waiting for the free page count to fall below the page daemon wakeup threshold again, at which point it will perform another limited (pass == 1) scan over the inactive queue. Changing vm_pageout_worker() to use vm_page_count_target() instead of vm_paging_needed() won't work because any page allocations that happen concurrently with the inactive queue scan will result in the free page count being below the target at the end of a successful scan. Instead, having vm_pageout_scan() return a value indicating success or failure is the most straightforward fix. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8111
# 79144408	11-Aug-2016	Alan Cox <alc@FreeBSD.org>	Correct errors and clean up the comments on the active queue scan. Eliminate some unnecessary blank lines. Reviewed by: kib, markj MFC after: 1 week
# f0edf3f8	05-Aug-2016	Alan Cox <alc@FreeBSD.org>	Correct a spelling error.
# 248fe642	04-Aug-2016	Alan Cox <alc@FreeBSD.org>	Clean up the comments and code style in and around vm_pageout_cluster(). In particular, fix factual, grammatical, and spelling errors in various comments, and remove comments that are out of place in this function. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7410
# 87ff568c	01-Aug-2016	Alan Cox <alc@FreeBSD.org>	Restore the historical behavior of "sysctl vm.swap_idle_enabled=1". Prior to r254304, we had separate functions for reclamation and laundering (vm_pageout_scan) versus updating usage information, i.e., "reference bits", on active pages (vm_pageout_page_stats), and we only performed vm_req_vmdaemon(VM_SWAP_IDLE) if vm_pages_needed was true. However, since r254303, if vm_swap_idle_enabled was "1", we have performed vm_req_vmdaemon(VM_SWAP_IDLE) regardless of whether we are short of free pages. This was unintended and too aggressive, so I suspect no one uses this feature. With this change, we restore the historical behavior and only perform vm_req_vmdaemon(VM_SWAP_IDLE) when we are short of free pages. Reviewed by: kib, markj
# 793172ea	29-Jul-2016	Alan Cox <alc@FreeBSD.org>	Remove a probe declaration that has been unused since r292469, when vm_pageout_grow_cache() was replaced. MFC after: 3 days
# f095d1bb	28-Jul-2016	Alan Cox <alc@FreeBSD.org>	Remove any mention of cache (PG_CACHE) pages from the comments in vm_pageout_scan(). That function has not cached pages since r284376. MFC after: 3 days
# 3ac8f842	27-Jul-2016	Mark Johnston <markj@FreeBSD.org>	De-pluralize "queues" where appropriate in the pagedaemon code. MFC after: 1 week
# a766ffd0	26-Jul-2016	Alan Cox <alc@FreeBSD.org>	Update a comment to reflect r284376. MFC after: 3 days
# 44be0a8e	23-Jul-2016	Mark Johnston <markj@FreeBSD.org>	Correct a comment - each page queue has its own lock. Reviewed by: alc MFC after: 3 days
# 20c58db9	19-Jul-2016	Mark Johnston <markj@FreeBSD.org>	Make vm_pageout_wakeup_thresh a u_int rather than an int. It's a threshold for v_free_count, which is of type u_int. This also lets us get rid of a cast in vm_paging_needed(). Reviewed by: alc MFC after: 1 week
# 95e2409a	22-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	Fix a LOR between vnode locks and allproc_lock. There is an order between covered vnode lock and allproc_lock, which is established by calling mountcheckdirs() while owning the covered vnode lock. mountcheckdirs() iterates over the processes, protected by allproc_lock. This order is needed and seems to be not avoidable. On the other hand, various VM daemons also need to iterate over all processes, and they lock and unlock user maps. Since unlock of the user map may trigger processing of the deferred map entries, it causes vnode locking to occur. Or, when vmspace is freed, dropping references on the vnode-backed object also lock vnodes. We get reverted order comparing with the mount/unmount order. For VM daemons, there is no need to own allproc_lock while we operate on vmspaces. If the process is held, it serves as the marker for allproc list, which allows to continue the iteration. Add _PHOLD_LITE() macro, similar to _PHOLD(), but not causing swap-in of the kernel stacks. It is used instead of _PHOLD() in vm code, since e.g. calling faultin() in OOM conditions only exaggerates the problem. Modernize comment describing PHOLD. Reported by: lists@yamagi.org Tested by: pho (previous version) Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 week Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6679
# 56ce0690	27-May-2016	Alan Cox <alc@FreeBSD.org>	The flag "vm_pages_needed" has long served two distinct purposes: (1) to indicate that threads are waiting for free pages to become available and (2) to indicate whether a wakeup call has been sent to the page daemon. The trouble is that a single flag cannot really serve both purposes, because we have two distinct targets for when to wakeup threads waiting for free pages versus when the page daemon has completed its work. In particular, the flag will be cleared by vm_page_free() before the page daemon has met its target, and this can lead to the OOM killer being invoked prematurely. To address this problem, a new flag "vm_pageout_wanted" is introduced. Discussed with: jeff Reviewed by: kib, markj Tested by: markj Sponsored by: EMC / Isilon Storage Division
# 763df3ec	02-May-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/vm: minor spelling fixes in comments. No functional change.
# 62d70a81	09-Apr-2016	John Baldwin <jhb@FreeBSD.org>	Add more fine-grained kernel options for NUMA support. VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in the virtual memory system. DEVICE_NUMA is used to enable affinity reporting for devices such as bus_get_domain(). MAXMEMDOM must still be set to a value greater than for any NUMA support to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is enabled and the system supports NUMA. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5782
# c869e672	19-Dec-2015	Alan Cox <alc@FreeBSD.org>	Introduce a new mechanism for relocating virtual pages to a new physical address and use this mechanism when: 1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical memory allocator's free page lists. This replaces the long-standing approach of scanning the inactive and inactive queues, converting clean pages into PG_CACHED pages and laundering dirty pages. In contrast, the new mechanism does not use PG_CACHED pages nor does it trigger a large number of I/O operations. 2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find free pages in the physical memory allocator's free page lists that are covered by the direct map. Tested by: adrian 3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable free pages in the physical memory allocator's free page lists. In the coming months, I expect that this new mechanism will be applied in other places. For example, balloon drivers should use relocation to minimize fragmentation of the guest physical address space. Make vm_phys_alloc_contig() a little smarter (and more efficient in some cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free page lists that can't possibly contain suitable pages. Reviewed by: kib, markj Glanced at: jhb Discussed with: jeff Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4444
# 2eb2f0d5	27-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	In vm_pageout_grow_cache(), do not re-try the inactive queue when active queue scan initiated write. Re-trying from the inactive queue when doing active scan makes the loop never end if number of domains is greater than 1 and inactive or active scan cannot reach the target. Reported and tested by: Andrew Gallatin <gallatin@netflix.com> Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 7672ca05	21-Nov-2015	Mark Johnston <markj@FreeBSD.org>	Remove unneeded includes of opt_kdtrace.h. As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h is not needed when defining SDT(9) probes.
# 76386c7e	15-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	Rework the test which raises OOM condition. Right now, the code checks for the swap space consumption plus checks that the amount of the free pages exceeds some limit, in case pagedeamon did not coped with the page shortage in one of the late passes. This is wrong because it does not account for the presence of the reclamaible pages in the queues which are not selectable for reclaim immediately. E.g., on the swap-less systems, large active queue easily triggered OOM. Instead, only raise OOM when pagedaemon is unable to produce a free page in several back-to-back passes. Track the failed passes per pagedaemon thread. The number of passes to trigger OOM was selected empirically and tested both on small (32M-64M i386 VM) and large (32G amd64) configurations. If the specifics of the load require tuning, sysctl vm.pageout_oom_seq sets the number of back-to-back passes which must fail before OOM is raised. Each pass takes 1/2 of seconds. Less the value, more sensible the pagedaemon is to the page shortage. In future, some heuristic to calculate the value of the tunable might be designed based on the system configuration and load. But before it can be done, the i/o system must be fixed to reliably time-out pagedaemon writes, even if waiting for the memory to proceed. Then, code can account for the in-flight page-outs and postpone OOM until all of them finished, which should reduce the need in tuning. Right now, ignoring the in-flight writes and the counter allows to break deadlocks due to write path doing sleepable memory allocations. Reported by: Dmitry Sivachenko, bde, many others Tested by: pho, bde, tuexen (arm) Reviewed by: alc Discussed with: bde, imp Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
# 3949873f	15-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	Do not use vmspace_resident_count() for the OOM process selection. Residency count track the number of pte entries installed into the current pmap, which does not reflect the consumption of the physical memory by the address map. Due to several mechanisms like pv entries reclamation, copy on write etc. the resident pte entries count may be much less than the amount of physical memory kept by the process. Provide the OOM-specific vm_pageout_oom_pagecount() function which estimates the amount of reclamaible memory which could be stolen if the process is killed. Reported and tested by: pho Reviewed by: alc Comments text by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
# b98acc0a	15-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	VM daemon works in parallel with the pagedaemon threads, and, among other actions, swaps out kernel stacks of the processes. On the other hand, currentl OOM logic which selects a process to kill in the critical condition, skips process with swapped-out thread. Under some loads, this results in the big(gest) process being ignored by OOM. Do not skip a process which has inhibited thread due to the swap-out, in the OOM selection loop. Note that killing such process requires the thread stack page-in, but sometimes this is the only way to recover. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
# 7e78597f	07-Nov-2015	Mark Johnston <markj@FreeBSD.org>	Ensure that deactivated pages that are not expected to be reused are reclaimed in FIFO order by the pagedaemon. Previously we would enqueue such pages at the head of the inactive queue, yielding a LIFO reclaim order. Reviewed by: alc MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division
# 69b8585e	18-Oct-2015	Konstantin Belousov <kib@FreeBSD.org>	Only marker is guaranteed to be present on the queue after the relock in vm_pageout_fallback_object_lock() and vm_pageout_page_lock(). The check for the m->queue == queue assumes that the page does belong to a queue. Modify the 'unchanged' calculation bu dereferencing the marker tailq pointers, which is known to belong to the queue. Since for a page m linked to the queue, m->queue must be equal to the queue index, assert this instead of checking. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 2 weeks
# 8748f58c	15-Oct-2015	Konstantin Belousov <kib@FreeBSD.org>	Revert r289302, invalid pages can be queued, e.g. by vfs_vmio_unwire(). Found by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# 12a73f20	14-Oct-2015	Konstantin Belousov <kib@FreeBSD.org>	Invalid pages should not appear on the inactive queue. Change the check into an assertion. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# bc727596	03-Oct-2015	Alan Cox <alc@FreeBSD.org>	Reduce the scope of a variable to the only file where it is used.
# d9347bca	20-Sep-2015	Alan Cox <alc@FreeBSD.org>	Correct a non-fatal error in vm_pageout_worker(). vm_pageout_worker() should not assume that vm_pages_needed will remain set while it sleeps. Other threads can clear vm_pages_needed by performing a sufficient number of vm_page_free() calls, e.g., process termination. The effect of this error was that vm_pageout_worker() would free and/or launder pages when, in fact, there was no shortage of free pages. Rewrite a nearby comment to describe all of the possible cases and not just the most common case. The problem being that the comment made the most common case seem like the only case. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
# 27a9fb2f	07-Sep-2015	Alan Cox <alc@FreeBSD.org>	To simplify upcoming changes to the inactive queue scan, change the code so that there is only one place where pages are freed and only one place where pages are moved to the tail of the queue. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# 960810cc	05-Sep-2015	Alan Cox <alc@FreeBSD.org>	Eliminate pointless requeueing of pages from terminated objects. These pages will have left the inactive queue before the page daemon performs its next scan. Also, ignore references to pages from terminated objects. This allows the clean pages to be freed a little sooner. Move some comments to their proper place, i.e., next to the code that they describe, and update other nearby comments. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# a3aeedab	01-Sep-2015	Alan Cox <alc@FreeBSD.org>	Handle held pages earlier in the inactive queue scan. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# 40aa80a7	27-Aug-2015	Alan Cox <alc@FreeBSD.org>	In vm_pageout_scan(), simplify the logic for determining if a page can be paged out and apply some nearby style fixes. In collaboration with: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation, EMC / Isilon Storage Division
# eb5d3969	24-Aug-2015	Alan Cox <alc@FreeBSD.org>	Testing whether a page is dirty does not require the page lock. Moreover, it may involve a pmap operation that iterates over the page's PV list, so unnecessarily holding the page lock is undesirable. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
# a6bf3a9e	20-Aug-2015	Ryan Stone <rstone@FreeBSD.org>	Prevent ticks rollover from preventing vm_lowmem event Currently vm_pageout_scan() uses a ticks-based scheme to rate-limit the number of times that the vm_lowmem event will happen. However if no events happen for long enough for ticks to roll over, this leaves us in a long window in which vm_lowmem events will not happen. Replace the use of ticks with time_t to prevent rollover from ever being an issue. Reviewed by: ian MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3439
# f9b11500	16-Aug-2015	Alan Cox <alc@FreeBSD.org>	As another piece of PG_CACHE page elimination, remove an LRU-defeating call to vm_page_try_to_cache() from vm_pageout_flush(). Other changes, most recently r286814, have made this call unnecessary. Reviewed by: kib Discussed with: jeff Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 22cf98d1	08-Jul-2015	Alan Cox <alc@FreeBSD.org>	The intention of r254304 was to scan the active queue continuously. However, I've observed the active queue scan stopping when there are frequent free page shortages and the inactive queue is steadily refilled by other mechanisms, such as the sequential access heuristic in vm_fault() or madvise(2). To remedy this problem, record the time of the last active queue scan, and always scan a number of pages proportional to the time since the last scan, regardless of whether that last scan was a timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass > 0") scan. Also, on a timeout-triggered scan, allow a full scan of the active queue when the system is short of inactive pages. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division
# aa044135	20-Jun-2015	Alan Cox <alc@FreeBSD.org>	Avoid pmap_is_modified() on pages that can't be mapped. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
# 776f729c	14-Jun-2015	Konstantin Belousov <kib@FreeBSD.org>	Invalid pages do not need neither update of the activation count nor they coould be dirty. Move the handling if the invalid pages in the inactive scan earlier. Remove some code duplication in the scan by introducing the 'drop_page' label, which centralizes the object and the page unlock. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 78afdce6	13-Jun-2015	Alan Cox <alc@FreeBSD.org>	As the next step in eliminating PG_CACHE pages, free rather than cache pages in vm_pageout_scan(). The reactivation rate of cache pages created by vm_pageout_scan() is extremely low; typically no more than 0.5% to 2.25% of the pages are ever reactivated. At the same time, caching pages is more expensive than freeing them. For example, in a test with PostgreSQL, this change reduced the amount of time spent in the inactive queue scan by 1/6. Differential Revision: https://reviews.freebsd.org/D2805 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# f6f6d240	10-Jun-2015	Mateusz Guzik <mjg@FreeBSD.org>	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.
# 44ec2b63	09-May-2015	Konstantin Belousov <kib@FreeBSD.org>	The vmem callback to reclaim kmem arena address space on low or fragmented conditions currently just wakes up the pagedaemon. The kmem arena is significantly smaller then the total available physical memory, which means that there are loads where kmem arena space could be exhausted, while there is a lot of pages available still. The woken up pagedaemon sees vm_pages_needed != 0, verifies the condition vm_paging_needed() which is false, clears the pass and returns back to sleep, not calling neither uma_reclaim() nor lowmem handler. To handle low kmem arena conditions, create additional pagedaemon thread which calls uma_reclaim() directly. The thread sleeps on the dedicated channel and kmem_reclaim() wakes the thread in addition to the pagedaemon. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 4b5c9cf6	29-Apr-2015	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation
# 34d8b7ea	06-Apr-2015	Jeff Roberson <jeff@FreeBSD.org>	- Simplify vm_pageout_scan() by introducing a new vm_pageout_clean() function that does the locking and validation associated with cleaning a page. This moves 150 lines of code into its own function. - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it really does; clustering nearby pages for pageout optimization. Reviewd by: alc, kib, kmacy Tested by: pho (earlier version) Sponsored by: EMC / Isilon
# b3de46ab	27-Mar-2015	Jeff Roberson <jeff@FreeBSD.org>	- Eliminate pagequeue locking in the dirty code in vm_pageout_scan(). - Use a more precise series of tests to see if the page changed while we were locking the vnode. Reviewed by: alc Sponsored by: EMC / Isilon
# 8311a2b8	24-Jan-2015	Will Andrews <will@FreeBSD.org>	Add vm.panic_on_oom sysctl, which enables those who would rather panic than kill a process, when the system runs out of memory. Defaults to off. Usually, this is most useful when the OOM condition is due to mismanagement of memory, on a system where the applications in question don't respond well to being killed. In theory, if the system is properly managed, it shouldn't be possible to hit this condition. If it does, the panic can be more desirable for some users (since it can be a good means of finding the root cause) rather than killing the largest process and continuing on its merry way. As kib@ mentions in the differential, there is also protect(1), which uses procctl(PROC_SPROTECT) to ensure that some processes are immune. However, a panic approach is still useful in some environments. This is primarily intended as a development/debugging tool. Differential Revision: D1627 Reviewed by: kib MFC after: 1 week
# 71943c3d	24-Jan-2015	Konstantin Belousov <kib@FreeBSD.org>	Avoid calling vmspace_free() while owning the process lock. Freeing of an vm space may require obtaining sleepable locks. Hold the process to keep the pointer valid, and change trylock to lock, since there is no longer two process locks owned simultaneously in vm_pageout_oom(). Note that after the process lock is dropped, process might exec, and no longer qualify as the owner of biggest vm space. In collaboration with: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 14a0d74e	03-Oct-2014	Steven Hartland <smh@FreeBSD.org>	Refactor ZFS ARC reclaim checks and limits Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay
# f721133e	24-Sep-2014	Steven Hartland <smh@FreeBSD.org>	Fix ticks wrap issue of lowmem test in vm_pageout_scan Reviewed by: jhb (D818) MFC after: 3 days Sponsored by: Multiplay
# 4d19f4ad	28-Aug-2014	Steven Hartland <smh@FreeBSD.org>	Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay
# 9452b5ed	26-Aug-2014	Alan Cox <alc@FreeBSD.org>	Back in the days when the kernel was single threaded, testing "vm_paging_target() > 0" was a reasonable way of determining if the inactive queue scan met its target. However, now that other threads can be allocating pages while the inactive queue scan is running, it's an unreliable method. The effect of it being unreliable is that we can start swapping out processes when we didn't intend to. This issue has existed since the kernel was multithreaded, but the changes to the inactive queue target in 10.0-RELEASE have made its effects visible. This change introduces a more direct method for determining if the inactive queue scan met its target that is not affected by the actions of other threads. Reported by: Steve Polyack Tested by: pho, Steve Polyack (an earlier version) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
# e7d939bd	06-Jul-2014	Marcel Moolenaar <marcel@FreeBSD.org>	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan
# 60196cda	05-May-2014	Alan Cox <alc@FreeBSD.org>	Prior to r254304, a separate function, vm_pageout_page_stats(), was used to periodically update the reference status of the active pages. This function was called, instead of vm_pageout_scan(), when memory was not scarce. The objective was to provide up to date reference status for active pages in case memory did become scarce and active pages needed to be deactivated. The active page queue scan performed by vm_pageout_page_stats() was virtually identical to that performed by vm_pageout_scan(), and so r254304 eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is called with the parameter "pass" set to zero. The intention was that when pass is zero, vm_pageout_scan() would only scan the active queue. However, the variable page_shortage can still be greater than zero when memory is not scarce and vm_pageout_scan() is called with pass equal to zero. Consequently, the inactive queue may be scanned and dirty pages laundered even though that was not intended by r254304. This revision fixes that. Reported by: avg MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
# 44f1c916	22-Mar-2014	Bryan Drewery <bdrewery@FreeBSD.org>	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division
# ab46f63e	20-Jan-2014	John Baldwin <jhb@FreeBSD.org>	Fix a couple of typos.
# 86fa2471	18-Jan-2014	Alan Cox <alc@FreeBSD.org>	Style changes in vm_pageout_scan(): 1. Be consistent in the style of "act_delta" manipulations between the inactive and active queue scans. 2. Explicitly compare to zero. 3. The deactivation of a page is based is based on its recent history and not just the current call to vm_pageout_scan(). The variable "act_delta" represents the current state of the page, and not its history. Avoid possible confusion by not (ab)using "act_delta" for the making the deactivation decision. Submitted by: kib [1] Reviewed by: kib [2,3]
# 9099545a	12-Jan-2014	Alan Cox <alc@FreeBSD.org>	Correctly update the count of stuck pages, "addl_page_shortage", in vm_pageout_scan(). There were missing increments in two less common cases. Don't conflate the count of stuck pages and the pageout deficit provided by vm_page_alloc{,_contig}(). (A proposed fix to the OOM code depends on this.) Handle held pages consistently in the inactive queue scan. In the more common case, we did not move the page to the tail of the queue. Whereas, in the less common case, we did. There's no particular reason to move the page in the less common case, so remove it. Perform the calculation of the page shortage for the active queue scan a little earlier, before the active queue lock is acquired. The correctness of this calculation doesn't depend on the active queue lock being held. Eliminate a redundant variable, "pcount". Use the more descriptive variable, "maxscan", in its place. Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess parentheses. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# 938b0f5b	25-Dec-2013	Marcel Moolenaar <marcel@FreeBSD.org>	For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is that we don't have a good way (yet) to iterate over the mapped pages by virtual address and simply try each page within the range. Given that we call pmap_remove() over the entire 2^63 bytes of address space, it takes a while for pmap_remove to have tried all 2^50 pages. By using pmap_remove_pages() we use the PV list to find all mappings. Change derived from a patch by: alc
# d395270d	25-Dec-2013	Dimitry Andric <dim@FreeBSD.org>	In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as argument, cast the incoming 0 argument to void , to silence a warning from clang 3.4 ("expression which evaluates to zero treated as a null pointer constant of type 'void ' [-Wnon-literal-null-conversion]"). MFC after: 3 days
# 1bd7d0b7	09-Nov-2013	Konstantin Belousov <kib@FreeBSD.org>	If filesystem declares that it supports shared locking for writes, use shared vnode lock for VOP_PUTPAGES() as well. The only such filesystem in the tree is ZFS, and it uses vnode_pager_generic_putpages(), which performs the pageout with VOP_WRITE(). Reviewed by: alc Discussed with: avg Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 274132ac	21-Aug-2013	Jeff Roberson <jeff@FreeBSD.org>	- Eliminate the vm object lock from the active queue scan. It is not necessary since we do not free or cache the page from active anymore. Document the one possible race that is harmless. Sponsored by: EMC / Isilon Storage Division Discussed with: alc
# c9612b2d	19-Aug-2013	Jeff Roberson <jeff@FreeBSD.org>	- Increase the active lru refresh interval to 10 minutes. This has been shown to negatively impact some workloads and the goal is only to eliminate worst case behaviors for very long periods of paging inactivity. Eventually we should determine a more complex scaling factor for this feature. - Rate limit low memory callback handlers to limit thrashing. Set the default to 10 seconds. Sponsored by: EMC / Isilon Storage Division
# 949c9186	17-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove the arbitrary binding of the pagedaemon threads to the domains, update the comment accordingly and make it more precise. Requested and reviewed by: jeff (previous version)
# 114f62c6	15-Aug-2013	Jeff Roberson <jeff@FreeBSD.org>	- Fix bug in r254304. Use the ACTIVE pq count for the active list processing, not inactive. This was the result of a bad merge. Reported by: pho Sponsored by: EMC / Isilon Storage Division
# d9e23210	13-Aug-2013	Jeff Roberson <jeff@FreeBSD.org>	Improve pageout flow control to wakeup more frequently and do less work while maintaining better LRU of active pages. - Change v_free_target to include the quantity previously represented by v_cache_min so we don't need to add them together everywhere we use them. - Add a pageout_wakeup_thresh that sets the free page count trigger for waking the page daemon. Set this 10% above v_free_min so we wakeup before any phase transitions in vm users. - Adjust down v_free_target now that we're willing to accept more pagedaemon wakeups. This means we process fewer pages in one iteration as well, leading to shorter lock hold times and less overall disruption. - Eliminate vm_pageout_page_stats(). This was a minor variation on the PQ_ACTIVE segment of the normal pageout daemon. Instead we now process 1 / vm_pageout_update_period pages every second. This causes us to visit the whole active list every 60 seconds. Previously we would only maintain the active LRU when we were short on pages which would mean it could be woefully out of date. Reviewed by: alc (slight variant of this) Discussed with: alc, kib, jhb Sponsored by: EMC / Isilon Storage Division
# c325e866	10-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# c7aebda8	09-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 449c2e92	07-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# bb7858ea	26-Jul-2013	Jeff Roberson <jeff@FreeBSD.org>	Improve page LRU quality and simplify the logic. - Don't short-circuit aging tests for unmapped objects. This biases against unmapped file pages and transient mappings. - Always honor PGA_REFERENCED. We can now use this after soft busying to lazily restart the LRU. - Don't transition directly from active to cached bypassing the inactive queue. This frees recently used data much too early. - Rename actcount to act_delta to be more consistent with use and meaning. Reviewed by: kib, alc Sponsored by: EMC / Isilon Storage Division
# 90776bd7	23-Jul-2013	Jeff Roberson <jeff@FreeBSD.org>	- Remove the long obsolete 'vm_pageout_algorithm' experiment. Discussed with: alc Sponsored by: EMC / Isilon Storage Division
# e23b0a19	03-Jun-2013	Alan Cox <alc@FreeBSD.org>	Relax the object locking in vm_pageout_map_deactivate_pages() and vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division
# a15f7df5	08-Apr-2013	Attilio Rao <attilio@FreeBSD.org>	The per-page act_count can be made very-easily protected by the per-page lock rather than vm_object lock, without any further overhead. Make the formal switch. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 53636869	27-Jan-2013	Andrey Zonov <zont@FreeBSD.org>	- Add sysctls to show number of stats scans. MFC after: 2 weeks
# 4a365329	27-Jan-2013	Andrey Zonov <zont@FreeBSD.org>	- Style. MFC after: 2 weeks
# 28634820	08-Dec-2012	Alan Cox <alc@FreeBSD.org>	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib
# 8d220203	12-Nov-2012	Alan Cox <alc@FreeBSD.org>	Replace the single, global page queues lock with per-queue locks on the active and inactive paging queues. Reviewed by: kib
# 9fc4739d	01-Nov-2012	Alan Cox <alc@FreeBSD.org>	In general, we call pmap_remove_all() before calling vm_page_cache(). So, the call to pmap_remove_all() within vm_page_cache() is usually redundant. This change eliminates that call to pmap_remove_all() and introduces a call to pmap_remove_all() before vm_page_cache() in the one place where it didn't already exist. When iterating over a paging queue, if the object containing the current page has a zero reference count, then the page can't have any managed mappings. So, a call to pmap_remove_all() is pointless. Change a panic() call in vm_page_cache() to a KASSERT(). MFC after: 6 weeks
# a406d8c3	28-Oct-2012	Edward Tomasz Napierala <trasz@FreeBSD.org>	Remove useless check; vm_pindex_t is unsigned on all architectures. CID: 3701 Found with: Coverity Prevent
# 5050aa86	22-Oct-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 7ecfabc7	13-Oct-2012	Alan Cox <alc@FreeBSD.org>	Move vm_page_requeue() to the only file that uses it. MFC after: 3 weeks
# 9af47af6	13-Oct-2012	Alan Cox <alc@FreeBSD.org>	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks
# 1f11f2bf	23-Sep-2012	Alan Cox <alc@FreeBSD.org>	Address a race condition that was introduced in r238212. Unless the page queues lock is acquired before the page lock is released, there is no guarantee that the page will still be in that same page queue when vm_page_requeue() is called. Reported by: pho In collaboration with: kib MFC after: 3 days
# 96240c89	14-Sep-2012	Eitan Adler <eadler@FreeBSD.org>	Correct double "the the" Approved by: cperciva MFC after: 3 days
# 18f55b01	06-Aug-2012	Alan Cox <alc@FreeBSD.org>	Never sleep on busy pages in vm_pageout_launder(), always skip them. Long ago, sleeping on busy pages in vm_pageout_launder() made sense. The call to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages blocked vm_pageout_launder() until the flush had completed. However, in CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was changed to request synchronous I/O, but the sleep on busy pages was not removed.
# 311e34e2	26-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Do not requeue held page or page for which locking failed, just leave them alone. Process the act_count updates for the held pages in the vm_pageout loop over the inactive queue, instead of refusing to do anything with such page. Clarify the intent of the addl_page_shortage counter and change its use for pages which are not processed in the loop according to the description. Reviewed by: alc MFC after: 2 weeks
# 2cba1ccd	23-Jul-2012	Alan Cox <alc@FreeBSD.org>	Addendum to r238604. If the inactive queue scan isn't restarted, then the variable "addl_page_shortage_init" isn't needed. X-MFC after: r238604
# d4961bcb	18-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Do not restart scan of the inactive queue when non-inactive page is found. Rather, we shall not find such pages on inactive queue at all. Requested and reviewed by: alc MFC after: 2 weeks
# 85eeca35	17-Jul-2012	Alan Cox <alc@FreeBSD.org>	Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar code resides. Rename vm_contig_grow_cache() to vm_pageout_grow_cache(). Reviewed by: kib
# a4156419	08-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Avoid vm page queues lock leak after r238212. Reported and tested by: Michael Butler <imb protected-networks net> Reviewed by: alc Pointy hat to: kib MFC after: 20 days
# 48cc2fc77	07-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Drop page queues mutex on each iteration of vm_pageout_scan over the inactive queue, unless busy page is found. Dropping the mutex often should allow the other lock acquires to proceed without waiting for whole inactive scan to finish. On machines with lot of physical memory scan often need to iterate a lot before it finishes or finds a page which requires laundring, causing high latency for other lock waiters. Suggested and reviewed by: alc MFC after: 3 weeks
# 5d10ef20	06-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Style. Reviewed by: alc (previous version) MFC after: 1 week
# 6031c68d	16-Jun-2012	Alan Cox <alc@FreeBSD.org>	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks
# 7900f95d	12-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Assert that fictitious or unmanaged pages do not appear on active/inactive lists. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
# 126d6082	17-Mar-2012	Konstantin Belousov <kib@FreeBSD.org>	In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks
# 8d01a3b2	16-Jan-2012	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Revert r212360 now that PowerPC can handle large sparse arguments to pmap_remove() (changed in r228412). MFC after: 2 weeks
# 3407fefe	06-Sep-2011	Konstantin Belousov <kib@FreeBSD.org>	Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)
# afcc55f3	06-Jul-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".
# a8229fa3	02-Jul-2011	Alan Cox <alc@FreeBSD.org>	Initialize marker pages as held rather than fictitious/wired. Marking the page as held is more useful as a safety precaution in case someone forgets to check for PG_MARKER. Reviewed by: kib
# f497cda2	06-Apr-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	In vm_daemon(), do not skip processes stopped with SIGSTOP.
# 099e7e95	06-Apr-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add RACCT_RSS. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
# 8e6fa660	24-Mar-2011	John Baldwin <jhb@FreeBSD.org>	Fix some locking nits with the p_state field of struct proc: - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week
# 3fccbe43	18-Mar-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	In vm_daemon(), when iterating over all processes in the system, skip those which are not yet fully initialized (i.e. ones with p_state == PRS_NEW). Without it, we could panic in _thread_lock_flags(). Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that require similar fix. Reported by: pho, keramida Discussed with: kib
# 4c6a2e7a	16-Jan-2011	Alan Cox <alc@FreeBSD.org>	Shift responsibility for synchronizing access to the page's act_count field to the object's lock. Reviewed by: kib@
# 17f6a17b	02-Jan-2011	Alan Cox <alc@FreeBSD.org>	Release the page lock early in vm_pageout_clean(). There is no reason to hold this lock until the end of the function. With the aforementioned change to vm_pageout_clean(), page locks don't need to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions. Reviewed by: kib
# 1e8a675c	18-Nov-2010	Konstantin Belousov <kib@FreeBSD.org>	vm_pageout_flush() might cache the pages that finished write to the backing storage. Such pages might be then reused, racing with the assert in vm_object_page_collect_flush() that verified that dirty pages from the run (most likely, pages with VM_PAGER_AGAIN status) are write-protected still. In fact, the page indexes for the pages that were removed from the object page list should be ignored by vm_object_page_clean(). Return the length of successfully written run from vm_pageout_flush(), that is, the count of pages between requested page and first page after requested with status VM_PAGER_AGAIN. Supply the requested page index in the array to vm_pageout_flush(). Use the returned run length to forward the index of next page to clean in vm_object_page_clean(). Reported by: avg Reviewed by: alc MFC after: 1 week
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 42768fec	09-Sep-2010	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	On architectures with non-tree-based page tables like PowerPC, every page in a range must be checked when calling pmap_remove(). Calling pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range of the map could result in attempting to demap an extraordinary number of pages (> 10^15), so iterate through each map entry and unmap each of them individually. MFC after: 6 weeks
# cdaba1f2	02-Jul-2010	Alan Cox <alc@FreeBSD.org>	Push down the acquisition of the page queues lock into vm_pageout_page_stats(). In particular, avoid acquiring the page queues lock unless iterating over the active queue.
# 9cf51988	02-Jul-2010	Alan Cox <alc@FreeBSD.org>	With the demise of page coloring, the page queue macros no longer serve any useful purpose. Eliminate them. Reviewed by: kib
# 95976f3f	30-Jun-2010	Alan Cox <alc@FreeBSD.org>	Simplify entry to vm_pageout_clean(). Expect the page to be locked. Previously, the caller unlocked the page, and vm_pageout_clean() immediately reacquired the page lock. Also, assert rather than test that the page is neither busy nor held. Since vm_pageout_clean() is called with the object and page locked, the page can't have changed state since the caller verified that the page is neither busy nor held.
# 91b4f427	21-Jun-2010	Alan Cox <alc@FreeBSD.org>	Introduce vm_page_next() and vm_page_prev(), and use them in vm_pageout_clean(). When iterating over a range of pages, these functions can be cheaper than vm_page_lookup() because their implementation takes advantage of the vm_object's memq being ordered. Reviewed by: kib@ MFC after: 3 weeks
# 9ee2165f	14-Jun-2010	Alan Cox <alc@FreeBSD.org>	Eliminate checks for a page having a NULL object in vm_pageout_scan() and vm_pageout_page_stats(). These checks were recently introduced by the first page locking commit, r207410, but they are not needed. At the same time, eliminate some redundant accesses to the page's object field. (These accesses should have neen eliminated by r207410.) Make the assertion in vm_page_flag_set() stricter. Specifically, only managed pages should have PG_WRITEABLE set. Add a comment documenting an assertion to vm_page_flag_clear(). It has long been the case that fictitious pages have their wire count permanently set to one. Add comments to vm_page_wire() and vm_page_unwire() documenting this. Add assertions to these functions as well. Update the comment describing vm_page_unwire(). Much of the old comment had little to do with vm_page_unwire(), but a lot to do with _vm_page_deactivate(). Move relevant parts of the old comment to _vm_page_deactivate(). Only pages that belong to an object can be paged out. Therefore, it is pointless for vm_page_unwire() to acquire the page queues lock and enqueue such pages in one of the paging queues. Generally speaking, such pages are immediately freed after the call to vm_page_unwire(). Previously, it was the call to vm_page_free() that reacquired the page queues lock and removed these pages from the paging queues. Now, we will never acquire the page queues lock for this case. (It is also worth noting that since both vm_page_unwire() and vm_page_free() occurred with the page locked, the page daemon never saw the page with its object field set to NULL.) Change the panic with vm_page_unwire() to provide a more precise message. Reviewed by: kib@
# ce186587	10-Jun-2010	Alan Cox <alc@FreeBSD.org>	Reduce the scope of the page queues lock and the number of PG_REFERENCED changes in vm_pageout_object_deactivate_pages(). Simplify this function's inner loop using TAILQ_FOREACH(), and shorten some of its overly long lines. Update a stale comment. Assert that PG_REFERENCED may be cleared only if the object containing the page is locked. Add a comment documenting this. Assert that a caller to vm_page_requeue() holds the page queues lock, and assert that the page is on a page queue. Push down the page queues lock into pmap_ts_referenced() and pmap_page_exists_quick(). (As of now, there are no longer any pmap functions that expect to be called with the page queues lock held.) Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever be passed an unmanaged page. Assert this rather than returning "0" and "FALSE" respectively. ARM: Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH(). Push down the page queues lock inside of pmap_clearbit(), simplifying pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write(). Additionally, this allows for avoiding the acquisition of the page queues lock in some cases. PowerPC/AIM: moea_page_exits_quick() and moea_page_wired_mappings() will never be called before pmap initialization is complete. Therefore, the check for moea_initialized can be eliminated. Push down the page queues lock inside of moea_clear_bit(), simplifying moea_clear_modify() and moea_clear_reference(). The last parameter to moea_clear_bit() is never used. Eliminate it. PowerPC/BookE: Simplify mmu_booke_page_exists_quick()'s control flow. Reviewed by: kib@
# 567e51e1	24-May-2010	Alan Cox <alc@FreeBSD.org>	Roughly half of a typical pmap_mincore() implementation is machine- independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)
# 04b7b96b	13-May-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r206814 (by alc): Remove a nonsensical test from vm_pageout_clean(). A page can't be in the inactive queue and have a non-zero wire count.
# 3c4a2440	08-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.
# af394cfa	07-May-2010	Jung-uk Kim <jkim@FreeBSD.org>	Fix a typo in the previous commit.
# af4b86b9	07-May-2010	Konstantin Belousov <kib@FreeBSD.org>	One more use for vm_pageout_init_marker(). Reviewed by: alc
# 8c616246	05-May-2010	Konstantin Belousov <kib@FreeBSD.org>	Add a helper function vm_pageout_page_lock(), similar to tegge' vm_pageout_fallback_object_lock(), to obtain the page lock while having page queue lock locked, and still maintain the page position in a queue. Use the helper to lock the page in the pageout daemon and contig launder iterators instead of skipping the page if its lock is contested. Skipping locked pages easily causes pagedaemon or launder to not make a progress with page cleaning. Proposed and reviewed by: alc
# 88389d61	03-May-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r206264: When OOM searches for a process to kill, ignore the processes already killed by OOM. When killed process waits for a page allocation, try to satisfy the request as fast as possible.
# 6c56db5c	02-May-2010	Alan Cox <alc@FreeBSD.org>	Eliminate an assignment that was made redundant by r207410.
# 447fe2a4	02-May-2010	Alan Cox <alc@FreeBSD.org>	Defer the acquisition of the page and page queues locks in vm_pageout_object_deactivate_pages().
# 7bec141b	30-Apr-2010	Kip Macy <kmacy@FreeBSD.org>	push up dropping of the page queue lock to avoid holding it in vm_pageout_flush
# e8f26319	30-Apr-2010	Kip Macy <kmacy@FreeBSD.org>	- don't check hold_count without the page lock held - don't leak the page lock if m->object is NULL (assuming that that check will in fact even be valid when m->object is protected by the page lock)
# 2965a453	29-Apr-2010	Kip Macy <kmacy@FreeBSD.org>	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
# 82bfb965	29-Apr-2010	Alan Cox <alc@FreeBSD.org>	Simplify the inner loop of vm_pageout_object_deactivate_pages(). Rather than checking each page for PG_UNMANAGED, check the vm object's type. Only OBJT_PHYS can have unmanaged pages. Eliminate a pointless counter. The vm object is locked, that lock is never released by the inner loop, and the set of pages contained by the vm object is not changed by the inner loop. Therefore, the counter serves no purpose.
# 5d4a7b79	19-Apr-2010	Alan Cox <alc@FreeBSD.org>	Eliminate an unnecessary call to pmap_remove_all(). If a page belongs to an object whose reference count is zero, then that page cannot possibly be mapped.
# b28889a2	18-Apr-2010	Alan Cox <alc@FreeBSD.org>	Remove a nonsensical test from vm_pageout_clean(). A page can't be in the inactive queue and have a non-zero wire count. Reviewed by: kib MFC after: 3 weeks
# 3f1c4c4f	06-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	When OOM searches for a process to kill, ignore the processes already killed by OOM. When killed process waits for a page allocation, try to satisfy the request as fast as possible. This removes the often encountered deadlock, where OOM continously selects the same victim process, that sleeps uninterruptibly waiting for a page. The killed process may still sleep if page cannot be obtained immediately, but testing has shown that system has much higher chance to survive in OOM situation with the patch. In collaboration with: pho Reviewed by: alc MFC after: 4 weeks
# 02d65815	07-Feb-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r202529: vunref() the vnode in vm object deallocation code for OBJT_VNODE appropriate number of times to prevent possible vnode reference leak.
# b9f180d1	17-Jan-2010	Konstantin Belousov <kib@FreeBSD.org>	When a vnode-backed vm object is referenced, it increments the vnode reference count, and decrements it on dereference. If referenced object is deallocated, object type is reset to OBJT_DEAD. Consequently, all vnode references that are owned by object references are never released. vunref() the vnode in vm object deallocation code for OBJT_VNODE appropriate number of times to prevent leak. Add an assertion to the vm_pageout() to make sure that we never get reference on the vnode but then do not execute code to release it. In collaboration with: pho Reviewed by: alc MFC after: 3 weeks
# 01381811	24-Jul-2009	John Baldwin <jhb@FreeBSD.org>	Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to a device pager (OBJT_DEVICE) object in that it uses fictitious pages to provide aliases to other memory addresses. The primary difference is that it uses an sglist(9) to determine the physical addresses for a given offset into the object instead of invoking the d_mmap() method in a device driver. Reviewed by: alc Approved by: re (kensmith) MFC after: 2 weeks
# 26f4eea5	23-Jun-2009	Alan Cox <alc@FreeBSD.org>	The bits set in a page's dirty mask are a subset of the bits set in its valid mask. Consequently, there is no need to perform a bit-wise and of the page's dirty and valid masks in order to determine which parts of a page are dirty and valid. Eliminate an unnecessary #include.
# b78ddb0b	28-May-2009	Alan Cox <alc@FreeBSD.org>	Revise vm_pageout_scan()'s handling of partially dirty pages. Specifically, rather than unconditionally making partially dirty pages fully dirty, only make partially dirty pages fully dirty if the pmap says that the page has been modified. (This change is also a small optimization. It eliminate an unnecessary call to pmap_is_modified() on pages that are mapped read only.) Suggested by: tegge
# 47916d0c	17-May-2009	Alan Cox <alc@FreeBSD.org>	Eliminate a pointless call to pmap_clear_reference() from vm_pageout_scan(). If the page belongs to an object with a reference count of zero, then it can't have any managed mappings on which to clear a reference bit.
# 7981aa24	28-Apr-2009	Konstantin Belousov <kib@FreeBSD.org>	Use the acquired reference to the vmspace instead of direct dereferencing of p->p_vmspace in a place where it was missed in r191277. Noted by: pluknet gmail com
# 6bed074c	19-Apr-2009	Konstantin Belousov <kib@FreeBSD.org>	In both pageout oom handler and vm_daemon, acquire the reference to the vmspace of the examined process instead of directly accessing its vmspace, that may change. Also, as an optimization, check for P_INEXEC flag before examining the process. Reported and tested by: pho (previous version) Reviewed by: alc MFC after: 3 week
# f4b0c119	19-Apr-2009	Alan Cox <alc@FreeBSD.org>	Calling pmap_clear_modify() after calling pmap_remove_write() is pointless. The latter function already clears the modified status from each of the page's mappings.
# 6129343d	16-Nov-2008	Konstantin Belousov <kib@FreeBSD.org>	Instead of forcing vn_start_write() to reset mp back to NULL for the failed calls with non-NULL vp, explicitely clear mp after failure. Tested by: stass Reviewed by: tegge PR: 123768 MFC after: 1 week
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 2025d69b	29-Sep-2008	Konstantin Belousov <kib@FreeBSD.org>	Move the code for doing out-of-memory grass from vm_pageout_scan() into the separate function vm_pageout_oom(). Supply a parameter for vm_pageout_oom() describing a reason for the call. Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone is exhausted. Reviewed by: alc Tested by: pho, jhb MFC after: 2 weeks
# 8d28bf04	21-Sep-2008	Alan Cox <alc@FreeBSD.org>	Prevent an integer overflow in vm_pageout_page_stats() on machines with a large number of physical pages. PR: 126158 Submitted by: Dmitry Tejblum MFC after: 3 days
# 6bd9cb1c	03-Aug-2008	Tom Rhodes <trhodes@FreeBSD.org>	Fill in a few sysctl descriptions. Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com> Approved by: alc
# e5b006ff	19-Mar-2008	Alan Cox <alc@FreeBSD.org>	Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent migration to vm/vm_page.c.
# 374ae2a3	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
# 237fdd78	16-Mar-2008	Robert Watson <rwatson@FreeBSD.org>	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# da31e3aa	25-Nov-2007	Alan Cox <alc@FreeBSD.org>	Make contigmalloc(9)'s page laundering more robust. Specifically, use vm_pageout_fallback_object_lock() in vm_contig_launder_page() to better handle a lock-ordering problem. Consequently, trylock's failure on the page's containing object no longer implies that the page cannot be laundered. MFC after: 6 weeks
# 5dfc2870	22-Nov-2007	Alan Cox <alc@FreeBSD.org>	Add a read/write sysctl for reconfiguring the maximum number of physical pages that can be wired. Submitted by: Eugene Grosbein PR: 114654 MFC after: 6 weeks
# 7bfda801	25-Sep-2007	Alan Cox <alc@FreeBSD.org>	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
# b61ce5b0	16-Sep-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
# 4cd45723	15-Sep-2007	Alan Cox <alc@FreeBSD.org>	Correct an assertion in vm_pageout_flush(). Specifically, if a page's status after vm_pager_put_pages() is VM_PAGER_PEND, then it could have already been recycled, i.e., freed and reallocated to a new purpose; thus, asserting that such pages cannot be written is inappropriate. Reported by: kris Submitted by: tegge Approved by: re (kensmith) MFC after: 1 week
# 14137dc0	02-Jul-2007	Alan Cox <alc@FreeBSD.org>	In the previous revision, when I replaced the unconditional acquisition of Giant in vm_pageout_scan() with VFS_LOCK_GIANT(), I had to eliminate the acquisition of the vnode interlock before releasing the vm object's lock because the vnode interlock cannot be held when VFS_LOCK_GIANT() is performed. Unfortunately, this allows the vnode to be recycled between the release of the vm object's lock and the vget() on the vnode. In this revision, I prevent the vnode from being recycled by acquiring another reference to the vm object and underlying vnode before releasing the vm object's lock. This change also addresses another preexisting but trivial problem. By acquiring another reference to the vm object, I also prevent the vm object from being recycled. Previously, the "vnodes skipped" counter could be wrong because if it examined a recycled vm object. Reported by: kib Reviewed by: kib Approved by: re (kensmith) MFC after: 3 weeks
# 97824da3	26-Jun-2007	Alan Cox <alc@FreeBSD.org>	Eliminate the use of Giant from vm_daemon(). Replace the unconditional use of Giant in vm_pageout_scan() with VFS_LOCK_GIANT(). Approved by: re (kensmith) MFC after: 3 weeks
# 9e897b1b	17-Jun-2007	Alan Cox <alc@FreeBSD.org>	Eliminate unnecessary checks from vm_pageout_clean(): The page that is passed to vm_pageout_clean() cannot possibly be PG_UNMANAGED because it came from the inactive queue and PG_UNMANAGED pages are not in any page queue. Moreover, PG_UNMANAGED pages only exist in OBJT_PHYS objects, and all pages within a OBJT_PHYS object are PG_UNMANAGED. So, if the page that is passed to vm_pageout_clean() is not PG_UNMANAGED, then it cannot be from an OBJT_PHYS object and its neighbors from the same object cannot themselves be PG_UNMANAGED. Reviewed by: tegge
# 2446e4f0	15-Jun-2007	Alan Cox <alc@FreeBSD.org>	Enable the new physical memory allocator. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re
# d076fbea	13-Jun-2007	Alan Cox <alc@FreeBSD.org>	Eliminate dead code: We have not performed pageouts on the kernel object in this millenium.
# 393a081d	10-Jun-2007	Attilio Rao <attilio@FreeBSD.org>	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)
# 982d11f8	04-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# b4b70819	04-Jun-2007	Attilio Rao <attilio@FreeBSD.org>	Do proper "locking" for missing vmmeters part. Now, we assume no more sched_lock protection for some of them and use the distribuited loads method for vmmeter (distribuited through CPUs). Reviewed by: alc, bde Approved by: jeff (mentor)
# 2feb50bf	31-May-2007	Attilio Rao <attilio@FreeBSD.org>	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 222d0195	18-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# e9f995d8	06-Feb-2007	Alan Cox <alc@FreeBSD.org>	Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the vm page queue free mutex instead of the vm page queue mutex.
# f67af5c9	17-Jan-2007	Xin LI <delphij@FreeBSD.org>	Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.
# 9af80719	21-Oct-2006	Alan Cox <alc@FreeBSD.org>	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
# 78985e42	01-Aug-2006	Alan Cox <alc@FreeBSD.org>	Complete the transition from pmap_page_protect() to pmap_remove_write(). Originally, I had adopted sparc64's name, pmap_clear_write(), for the function that is now pmap_remove_write(). However, this function is more like pmap_remove_all() than like pmap_clear_modify() or pmap_clear_reference(), hence, the name change. The higher-level rationale behind this change is described in src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm trying to clean up and fix our support for execute access. Reviewed by: marcel@ (ia64)
# 625e6c0a	17-Feb-2006	Tor Egge <tegge@FreeBSD.org>	Expand scope of marker to reduce the number of page queue scan restarts.
# db27dcc0	17-Feb-2006	Tor Egge <tegge@FreeBSD.org>	Check return value from nonblocking call to vn_start_write().
# 3b7db47d	04-Feb-2006	Alan Cox <alc@FreeBSD.org>	Remove an unnecessary call to pmap_remove_all(). The given page is not mapped because its contents are invalid.
# 997e1c25	27-Jan-2006	Alan Cox <alc@FreeBSD.org>	Use the new macros abstracting the page coloring/queues implementation. (There are no functional changes.)
# ef39c05b	31-Dec-2005	Alexander Leidinger <netchild@FreeBSD.org>	MI changes: - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)
# 7a35a21e	09-Nov-2005	Alan Cox <alc@FreeBSD.org>	Reimplement the reclamation of PV entries. Specifically, perform reclamation synchronously from get_pv_entry() instead of asynchronously as part of the page daemon. Additionally, limit the reclamation to inactive pages unless allocation from the PV entry zone or reclamation from the inactive queue fails. Previously, reclamation destroyed mappings to both inactive and active pages. get_pv_entry() still, however, wakes up the page daemon when reclamation occurs. The reason being that the page daemon may move some pages from the active queue to the inactive queue, making some new pages available to future reclamations. Print the "reclaiming PV entries" message at most once per minute, but don't stop printing it after the fifth time. This way, we do not give the impression that the problem has gone away. Reviewed by: tegge
# 8dbca793	09-Aug-2005	Tor Egge <tegge@FreeBSD.org>	Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE due to the vm object being locked. When a process writes large amounts of data to a file, the vm object associated with that file can contain most of the physical pages on the machine. If the process is preempted while holding the lock on the vm object, pagedaemon would be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to other vm objects. Temporarily unlock the page queues lock while locking vm objects to avoid lock order violation. Detect and handle relevant page queue changes. This change depends on both the lock portion of struct vm_object and normal struct vm_page being type stable. Reviewed by: alc
# 60727d8b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# df2e33bf	06-Jan-2005	Alan Cox <alc@FreeBSD.org>	Revise the part of vm_pageout_scan() that moves pages from the cache queue to the free queue. With this change, if a page from the cache queue belongs to a locked object, it is simply skipped over rather than moved to the inactive queue.
# 34d9e6fd	04-Nov-2004	Alan Cox <alc@FreeBSD.org>	During traversal of the inactive queue, try locking the page's containing object before accessing the page's flags or the object's reference count.
# d19ef814	03-Nov-2004	Alan Cox <alc@FreeBSD.org>	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
# b86e6ec0	30-Oct-2004	Alan Cox <alc@FreeBSD.org>	During traversal of the active queue by vm_pageout_page_stats(), try locking the page's containing object before accessing the page's flags.
# 4b8a5c40	30-Oct-2004	Alan Cox <alc@FreeBSD.org>	Add an assignment statement that I omitted from the previous revision.
# b08abf6c	27-Oct-2004	Alan Cox <alc@FreeBSD.org>	During traversal of the active queue, try locking the page's containing object before accessing the page's flags or the object's reference count. If the trylock fails, handle the page as though it is busy.
# 3e36afbe	17-Jul-2004	Alan Cox <alc@FreeBSD.org>	Remove the GIANT_REQUIRED preceding pmap_remove() in vm_pageout_map_deactivate_pages().
# 3d2e54c3	15-Jul-2004	Alan Cox <alc@FreeBSD.org>	Push down the acquisition and release of the page queues lock into pmap_protect() and pmap_remove(). In general, they require the lock in order to modify a page's pv list or flags. In some cases, however, pmap_protect() can avoid acquiring the lock.
# 26354d4c	12-Jul-2004	Alan Cox <alc@FreeBSD.org>	Remove an unused and unimplemented sysctl. (For the record, it was marked as unimplemented in revision 1.129 nearly six years ago.)
# 5e609009	23-Jun-2004	Alan Cox <alc@FreeBSD.org>	Call vm_pageout_page_stats() with the page queues lock held.
# 1aab16a6	23-Jun-2004	Alan Cox <alc@FreeBSD.org>	Remove spl calls.
# fa885116	15-Jun-2004	Julian Elischer <julian@FreeBSD.org>	Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
# f651b129	11-May-2004	Alan Cox <alc@FreeBSD.org>	Cache queue pages are not mapped. Thus, the pmap_remove_all() by vm_pageout_scan()'s loop for freeing cache queue pages is unnecessary.
# dcbcd518	04-Mar-2004	Bruce Evans <bde@FreeBSD.org>	Minor style fixes. In vm_daemon(), don't fetch the rss limit long before it is needed.
# 9ea8d1a6	21-Feb-2004	Alan Cox <alc@FreeBSD.org>	Eliminate the second, unnecessary call to pmap_page_protect() near the end of vm_pageout_flush(). Instead, assert that the page is still write protected. Discussed with: tegge
# 84d98bf6	14-Feb-2004	Alan Cox <alc@FreeBSD.org>	- Correct a long-standing race condition in vm_page_try_to_cache() that could result in a panic "vm_page_cache: caching a dirty page, ...": Access to the page must be restricted or removed before calling vm_page_cache(). This race condition is identical in nature to that which was addressed by vm_pageout.c's revision 1.251. - Simplify the code surrounding the fix to this same race condition in vm_pageout.c's revision 1.251. There should be no behavioral change. Reviewed by: tegge MFC after: 7 days
# a3dfacb5	10-Feb-2004	Alan Cox <alc@FreeBSD.org>	Correct a long-standing race condition in the inactive queue scan. (See the added comment for low-level details.) The effect of this race condition is a panic "vm_page_cache: caching a dirty page, ..." Reviewed by: tegge MFC after: 7 days
# 91d5354a	04-Feb-2004	John Baldwin <jhb@FreeBSD.org>	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64
# 2e3b314d	24-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Push down Giant from vm_pageout() to vm_pageout_scan(), freeing vm_pageout_page_stats() from Giant. - Modify vm_pager_put_pages() and vm_pager_page_unswapped() to expect the vm object to be locked on entry. (All of the pager routines now expect this.)
# ab42316c	22-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from vm_pageout_scan(). Rationale: I don't like leaving a busy page in the cache queue with neither the vm object nor the vm page queues lock held. - Assert that the page is active in vm_pageout_page_stats().
# d3c09dd7	21-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Assert that every page found in the active queue is an active page.
# 7a935082	18-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Increase the object lock's scope in vm_contig_launder() so that access to the object's type field and the call to vm_pageout_flush() are synchronized. - The above change allows for the eliminaton of the last parameter to vm_pageout_flush(). - Synchronize access to the page's valid field in vm_pageout_flush() using the containing object's lock.
# 6989c456	16-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Synchronize access to a vm page's valid field using the containing vm object's lock. - Release the vm object and vm page queues locks around vput().
# 45ae1d91	18-Sep-2003	Alan Cox <alc@FreeBSD.org>	Merge vm_pageout_free_page_calc() into vm_pageout(), eliminating some unneeded code.
# 4b5f5531	17-Sep-2003	Alan Cox <alc@FreeBSD.org>	When calling vget() on a vnode-backed vm object, acquire the vnode interlock before releasing the vm object's lock.
# 3562af12	30-Aug-2003	Alan Cox <alc@FreeBSD.org>	- Add vm object locking to the part of vm_pageout_scan() that launders dirty pages. - Remove some unused variables.
# 3e1b578a	14-Aug-2003	Alan Cox <alc@FreeBSD.org>	Extend the scope of the page queues lock in vm_pageout_scan() to cover the traversal of the PQ_INACTIVE queue.
# 8f60c087	03-Aug-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Change the layout policy of the swap_pager from a hardcoded width striping to a per device round-robin algorithm. Because of the policy of not attempting to retain previous swap allocation on page-out, this means that a newly added swap device almost instantly takes its 1/N share of the I/O load but it takes somewhat longer for it to assume it's 1/N share of the pages if there is plenty of space on the other devices. Change the 8G total swapspace limitation to 8G per device instead by using a per device blist rather than one global blist. This reduces the memory footprint by 75% (typically a couple hundred kilobytes) for the common case with one swapdevice but NSWAPDEV=4. Remove the compile time constant limit of number of swap devices, there is no limit now. Instead of a fixed size array, store the per swapdev structure in a TAILQ. Total swap space is still addressed by a 32 bit page number and therefore the upper limit is now 2^42 bytes = 16TB (for i386). We still do not allocate the first page of each device in order to give some amount of protection to any bsdlabel at the start of the device. A new device is appended after the existing devices in the swap space, no attempt is made to fill in holes left behind by swapoff (this can trivially be changed should it ever become a problem). The sysctl vm.nswapdev now reflects the number of currently configured swap devices. Rename vm_swap_size to swap_pager_avail for consistency with other exported names. Change argument type for vm_proc_swapin_all() and swap_pager_isswapped() to be a struct swdevt pointer rather than an index. Not changed: we are still using blists to manage the free space, but since the swapspace is no longer fragmented by the striping different resource managers might fare better.
# ecf6279f	07-Jul-2003	Alan Cox <alc@FreeBSD.org>	- Complete the vm object locking in vm_pageout_object_deactivate_pages(). - Change vm_pageout_object_deactivate_pages()'s first parameter from a vm_map_t to a pmap_t. - Change vm_pageout_object_deactivate_pages()'s and vm_pageout_map_deactivate_pages()'s last parameter from a vm_pindex_t to a long. Since the number of pages in an address space doesn't require 64 bits on an i386, vm_pindex_t is overkill.
# 0774dfb3	29-Jun-2003	Alan Cox <alc@FreeBSD.org>	Add vm object locking to vm_pageout_map_deactivate_pages().
# 5163584c	28-Jun-2003	Alan Cox <alc@FreeBSD.org>	- Add vm object locking to vm_pageout_clean().
# 874651b1	11-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# e92686d0	18-May-2003	David Schultz <das@FreeBSD.org>	If we seem to be out of VM, don't allow the pagedaemon to kill processes in the first pass. Among other things, this will give us a chance to launder vnode-backed pages before concluding that we need more swap. This is particularly useful for systems that have no swap. While here, update a comment and remove some long-unused code. Reported by: Lucky Green <shamrock@cypherpunks.to> Suggested by: dillon Approved by: re (rwatson)
# c4a1d732	04-May-2003	Alan Cox <alc@FreeBSD.org>	Avoid a lock-order reversal and implement vm_object locking in vm_pageout_page_free().
# 85b1dc89	29-Apr-2003	Alan Cox <alc@FreeBSD.org>	Eliminate an unused parameter from vm_pageout_object_deactivate_pages().
# b6e48e03	23-Apr-2003	Alan Cox <alc@FreeBSD.org>	- Acquire the vm_object's lock when performing vm_object_page_clean(). - Add a parameter to vm_pageout_flush() that tells vm_pageout_flush() whether its caller has locked the vm_object. (This is a temporary measure to bootstrap vm_object locking.)
# 897ecacd	22-Apr-2003	John Baldwin <jhb@FreeBSD.org>	Lock the proc to check p_flag and several other related tests in vm_daemon(). We don't need to hold sched_lock as long now as a result.
# 72ba747d	20-Apr-2003	Alan Cox <alc@FreeBSD.org>	- Lock the vm_object when performing vm_object_pip_wakeup(). - Merge two identical cases in a switch statement.
# d22bc710	19-Apr-2003	Alan Cox <alc@FreeBSD.org>	- Lock the vm_object when performing vm_object_pip_add().
# f4cf2141	31-Mar-2003	Wes Peters <wes@FreeBSD.org>	Add a facility allowing processes to inform the VM subsystem they are critical and should not be killed when pageout is looking for more memory pages in all the wrong places. Reviewed by: arch@ Sponsored by: St. Bernard Software
# 72d97679	12-Mar-2003	David Schultz <das@FreeBSD.org>	- When the VM daemon is out of swap space and looking for a process to kill, don't block on a map lock while holding the process lock. Instead, skip processes whose map locks are held and find something else to kill. - Add vm_map_trylock_read() to support the above. Reviewed by: alc, mike (mentor)
# 6b4b77ad	09-Feb-2003	Alan Cox <alc@FreeBSD.org>	Add a comment describing how pagedaemon_wakeup() should be used and synchronized. Suggested by: tegge
# a1c0a785	02-Feb-2003	Alan Cox <alc@FreeBSD.org>	- It's more accurate to say that vm_paging_needed() returns TRUE than a positive number. - In pagedaemon_wakeup(), set vm_pages_needed to 1 rather than incrementing it to accomplish the same.
# 8e1d8de5	01-Feb-2003	Alan Cox <alc@FreeBSD.org>	- Convert vm_pageout()'s tsleep()s to msleep()s with the page queue lock.
# 8b245767	01-Feb-2003	Alan Cox <alc@FreeBSD.org>	- Remove (some) unnecessary explicit initializations to zero. - Style changes to vm_pageout(): declarations and white-space.
# b0ef8c5f	13-Jan-2003	Alan Cox <alc@FreeBSD.org>	- Update vm_pageout_deficit using atomic operations. It's a simple counter outside the scope of existing locks. - Eliminate a redundant clearing of vm_pageout_deficit.
# ff2023a5	13-Jan-2003	Alan Cox <alc@FreeBSD.org>	Make vm_pageout_page_free() static.
# c410df59	03-Jan-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Avoid extern decls in .c files by putting them in the vm/swap_pager.h include file where they belong. Share the dmmax_mask variable.
# 40bb4f4b	28-Dec-2002	Matthew Dillon <dillon@FreeBSD.org>	vm_pager_put_pages() takes VM_PAGER_* flags, not OBJPC_* flags. It just so happens that OBJPC_SYNC has the same value as VM_PAGER_PUT_SYNC so no harm done. But fix it :-) No operational changes. MFC after: 1 day
# b365ea9e	17-Dec-2002	Alan Cox <alc@FreeBSD.org>	Hold the page queues lock when performing vm_page_flag_set().
# 38857e7f	30-Nov-2002	Alan Cox <alc@FreeBSD.org>	Hold the page queues lock when calling pmap_protect(); it updates fields of the vm_page structure. Nearby, remove an unnecessary semicolon and return statement. Approved by: re (blanket)
# 78f7187d	30-Nov-2002	Alan Cox <alc@FreeBSD.org>	Increase the scope of the page queue lock in vm_pageout_scan(). Approved by: re (blanket)
# ba0208b9	23-Nov-2002	Alan Cox <alc@FreeBSD.org>	Assert that the page queues lock rather than Giant is held in vm_pageout_page_free(). Approved by: re
# 855a310f	21-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add an event that is triggered when the system is low on memory. This is intended to be used by significant memory consumers so that they may drain some of their caches. Inspired by: phk Approved by: re Tested on: x86, alpha
# a12cc0e4	17-Nov-2002	Alan Cox <alc@FreeBSD.org>	Remove vm_page_protect(). Instead, use pmap_page_protect() directly.
# 4fec79be	16-Nov-2002	Alan Cox <alc@FreeBSD.org>	Now that pmap_remove_all() is exported by our pmap implementations use it directly.
# eea85e9b	12-Nov-2002	Alan Cox <alc@FreeBSD.org>	Move pmap_collect() out of the machine-dependent code, rename it to reflect its new location, and add page queue and flag locking. Notes: (1) alpha, i386, and ia64 had identical implementations of pmap_collect() in terms of machine-independent interfaces; (2) sparc64 doesn't require it; (3) powerpc had it as a TODO.
# d154fb4f	10-Nov-2002	Alan Cox <alc@FreeBSD.org>	When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than indirectly through vm_page_protect(). The one remaining page flag that is updated by vm_page_protect() is already being updated by our various pmap implementations. Note: A later commit will similarly change the VM_PROT_READ case and eliminate vm_page_protect().
# b43179fb	11-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch
# 3ef3e7c4	24-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Get rid of the unused LK_NOOBJ.
# 05ba50f5	21-Sep-2002	Jake Burkholder <jake@FreeBSD.org>	Use the fields in the sysentvec and in the vm map header in place of the constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS. This is mainly so that they can be variable even for the native abi, based on different machine types. Get stack protections from the sysentvec too. This makes it trivial to map the stack non-executable for certain abis, on machines that support it.
# 71fad9fd	11-Sep-2002	Julian Elischer <julian@FreeBSD.org>	Completely redo thread states. Reviewed by: davidxu@freebsd.org
# 67ef391e	10-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_activate().
# 299018d3	27-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_free(). o Increment cnt.v_dfree inside vm_pageout_page_free() rather than at each call.
# 55df3298	27-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Require that the page queues lock is held on entry to vm_pageout_clean() and vm_pageout_flush(). o Acquire the page queues lock before calling vm_pageout_clean() or vm_pageout_flush().
# ce18aebd	27-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_activate() and vm_page_deactivate() in vm_pageout_object_deactivate_pages(). o Apply some style fixes to vm_pageout_object_deactivate_pages().
# 8ffc1519	22-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Extend the scope of the page queues lock in vm_pageout_scan() to cover the traversal of the cache queue.
# 40eab1e9	20-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_try_to_cache(). (The accesses in kern/vfs_bio.c are already locked.) o Assert that the page queues lock is held in vm_page_try_to_cache().
# 15a5d210	20-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_cache() in vm_fault() and vm_pageout_scan(). (The others are already locked.) o Assert that the page queues lock is held in vm_page_cache().
# 48c0444c	20-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock accesses to the active page queue in vm_pageout_scan() and vm_pageout_page_stats().
# e602ba25	29-Jun-2002	Julian Elischer <julian@FreeBSD.org>	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# d974f03c	28-Apr-2002	Alan Cox <alc@FreeBSD.org>	o Introduce and use vm_map_trylock() to replace several direct uses of lockmgr(). o Add missing synchronization to vmspace_swap_count(): Obtain a read lock on the vm_map before traversing it.
# 670d17b5	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove references to vm_zone.h and switch over to the new uma API.
# 11caded3	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# 8355f576	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# 25adb370	18-Mar-2002	Brian Feldman <green@FreeBSD.org>	Back out the modification of vm_map locks from lockmgr to sx locks. The best path forward now is likely to change the lockmgr locks to simple sleep mutexes, then see if any extra contention it generates is greater than removed overhead of managing local locking state information, cost of extra calls into lockmgr, etc. Additionally, making the vm_map lock a mutex and respecting it properly will put us much closer to not needing Giant magic in vm.
# 0e0af8ec	13-Mar-2002	Brian Feldman <green@FreeBSD.org>	Rename SI_SUB_MUTEX to SI_SUB_MTX_POOL to make the name at all accurate. While doing this, move it earlier in the sysinit boot process so that the VM system can use it. After that, the system is now able to use sx locks instead of lockmgr locks in the VM system. To accomplish this, some of the more questionable uses of the locks (such as testing whether they are owned or not, as well as allowing shared+exclusive recursion) are removed, and simpler logic throughout is used so locks should also be easier to understand. This has been tested on my laptop for months, and has not shown any problems on SMP systems, either, so appears quite safe. One more user of lockmgr down, many more to go :)
# a1287949	10-Mar-2002	Eivind Eklund <eivind@FreeBSD.org>	- Remove a number of extra newlines that do not belong here according to style(9) - Minor space adjustment in cases where we have "( ", " )", if(), return(), while(), for(), etc. - Add /* SYMBOL */ after a few #endifs. Reviewed by: alc
# 7f3a4093	27-Feb-2002	Mike Silbersack <silby@FreeBSD.org>	Fix a horribly suboptimal algorithm in the vm_daemon. In order to determine what to page out, the vm_daemon checks reference bits on all pages belonging to all processes. Unfortunately, the algorithm used reacted badly with shared pages; each shared page would be checked once per process sharing it; this caused an O(N^2) growth of tlb invalidations. The algorithm has been changed so that each page will be checked only 16 times. Prior to this change, a fork/sleepbomb of 1300 processes could cause the vm_daemon to take over 60 seconds to complete, effectively freezing the system for that time period. With this change in place, the vm_daemon completes in less than a second. Any system with hundreds of processes sharing pages should benefit from this change. Note that the vm_daemon is only run when the system is under extreme memory pressure. It is likely that many people with loaded systems saw no symptoms of this problem until they reached the point where swapping began. Special thanks go to dillon, peter, and Chuck Cranor, who helped me get up to speed with vm internals. PR: 33542, 20393 Reviewed by: dillon MFC after: 1 week
# ef6020d1	19-Feb-2002	Mike Silbersack <silby@FreeBSD.org>	Changes to make the OOM killer much more effective: - Allow the OOM killer to target processes currently locked in memory. These very often are the ones doing the memory hogging. - Drop the wakeup priority of processes currently sleeping while waiting for their page fault to complete. In order for the OOM killer to work well, the killed process and other system processes waiting on memory must be allowed to wakeup first. Reviewed by: dillon MFC after: 1 week
# 027df6bd	31-Jan-2002	Matthew Dillon <dillon@FreeBSD.org>	GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer cache lockups for over a year now. MFC after: 0 days
# 23b59018	20-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release
# 57601bcb	21-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	Syntax cleanup and documentation, no operational changes. MFC after: 1 day
# 30105b9e	14-Oct-2001	Tor Egge <tegge@FreeBSD.org>	Don't remove all mappings of a swapped out process if the vm map contained wired entries. vm_fault_unwire() depends on the mapping being intact. Reviewed by: dillon
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 6d03d577	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc). Also removed some spl's and added some VM mutexes, but they are not actually used yet, so this commit does not really make any operational changes to the system. vm_page.c relates to vm_page_t manipulation, including high level deactivation, activation, etc... vm_pageq.c relates to finding free pages and aquiring exclusive access to a page queue (exclusivity part not yet implemented). And the world still builds... :-)
# 54d92145	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	whitespace / register cleanup
# 0cddd8f0	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# ad6c5bbe	20-Jun-2001	John Baldwin <jhb@FreeBSD.org>	Don't lock around swap_pager_swap_init() that is only called once during the pagedaemon's startup code since it calls malloc which results in lock order reversals.
# 69a78d46	19-Jun-2001	John Baldwin <jhb@FreeBSD.org>	Put the scheduler, vmdaemon, and pagedaemon kthreads back under Giant for now. The proc locking isn't actually safe yet and won't be until the proc locking is finished.
# ff2b5645	09-Jun-2001	Matthew Dillon <dillon@FreeBSD.org>	Two fixes to the out-of-swap process termination code. First, start killing processes a little earlier to avoid a deadlock. Second, when calculating the 'largest process' do not just count RSS. Instead count the RSS + SWAP used by the process. Without this the code tended to kill small inconsequential processes like, oh, sshd, rather then one of the many 'eatmem 200MB' I run on a whim :-). This fix has been extensively tested on -stable and somewhat tested on -current and will be MFCd in a few days. Shamed into fixing this by: ps
# 3614c6fc	23-May-2001	John Baldwin <jhb@FreeBSD.org>	- Add in several asserts of vm_mtx. - Assert Giant in vm_pageout_scan() for the vnode hacking that it does. - Don't hold vm_mtx around vget() or vput(). - Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also, lock curproc while setting the P_BUFEXHAUST flag. - For now we still hold Giant for all of the vm_daemon. When process limits are locked we will be only need Giant for swapout_procs().
# 23955314	18-May-2001	Alfred Perlstein <alfred@FreeBSD.org>	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 1c58e4e5	17-May-2001	John Baldwin <jhb@FreeBSD.org>	During the code to pick a process to kill when memory is exhausted, keep the process in question locked as soon as we find it and determine it to be eligible until we actually kill it. To avoid deadlock, we don't block on the process lock but skip any process that is already locked during our search.
# fb919e4d	01-May-2001	Mark Murray <markm@FreeBSD.org>	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 1005a129	28-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Convert the allproc and proctree locks from lockmgr locks to sx locks.
# 9ed346ba	08-Feb-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# fc2ffbe6	04-Feb-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# 8606d880	24-Jan-2001	John Baldwin <jhb@FreeBSD.org>	- Catch up to proc flag changes. - Minimal proc locking. - Use queue macros.
# 2b6b0df7	26-Dec-2000	Matthew Dillon <dillon@FreeBSD.org>	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.
# 21cd6e62	13-Dec-2000	Seigo Tanimura <tanimura@FreeBSD.org>	- If swap metadata does not fit into the KVM, reduce the number of struct swblock entries by dividing the number of the entries by 2 until the swap metadata fits. - Reject swapon(2) upon failure of swap_zone allocation. This is just a temporary fix. Better solutions include: (suggested by: dillon) o reserving swap in SWAP_META_PAGES chunks, and o swapping the swblock structures themselves. Reviewed by: alfred, dillon
# c0c25570	12-Dec-2000	Jake Burkholder <jake@FreeBSD.org>	- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead of explicit calls to lockmgr. Also provides macros for the flags pased to specify shared, exclusive or release which map to the lockmgr flags. This is so that the use of lockmgr can be easily replaced with optimized reader-writer locks. - Add some locking that I missed the first time.
# 02fa91d3	11-Dec-2000	Matthew Dillon <dillon@FreeBSD.org>	Be less conservative with a recently added KASSERT. Certain edge cases with file fragments and read-write mmap's can lead to a situation where a VM page has odd dirty bits, e.g. 0xFC - due to being dirtied by an mmap and only the fragment (representing a non-page-aligned end of file) synced via a filesystem buffer. A correct solution that guarentees consistent m->dirty for the file EOF case is being worked on. In the mean time we can't be so conservative in the KASSERT.
# c5a44a6a	01-Dec-2000	John Baldwin <jhb@FreeBSD.org>	Protect p_stat with sched_lock.
# 553629eb	22-Nov-2000	Jake Burkholder <jake@FreeBSD.org>	Protect the following with a lockmgr lock: allproc zombproc pidhashtbl proc.p_list proc.p_hash nextpid Reviewed by: jhb Obtained from: BSD/OS and netbsd
# 936524aa	18-Nov-2000	Matthew Dillon <dillon@FreeBSD.org>	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
# 0384fff8	06-Sep-2000	Jason Evans <jasone@FreeBSD.org>	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# f2a2857b	11-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 8b03c8ed	29-May-2000	Matthew Dillon <dillon@FreeBSD.org>	This is a cleanup patch to Peter's new OBJT_PHYS VM object type and sysv shared memory support for it. It implements a new PG_UNMANAGED flag that has slightly different characteristics from PG_FICTICIOUS. A new sysctl, kern.ipc.shm_use_phys has been added to enable the use of physically-backed sysv shared memory rather then swap-backed. Physically backed shm segments are not tracked with PV entries, allowing programs which use a large shm segment as a rendezvous point to operate without eating an insane amount of KVM in the PV entry management. Read: Oracle. Peter's OBJT_PHYS object will also allow us to eventually implement page-table sharing and/or 4MB physical page support for such segments. We're half way there.
# 883f3caa	28-May-2000	Matthew Dillon <dillon@FreeBSD.org>	Fix bug in vm_pageout_page_stats() that always resulted in a full scan of the active queue. This fix is not expected to have any noticeable impact on performance. Noticed by: Rik van Riel <riel@conectiva.com.br>
# 24964514	21-May-2000	Peter Wemm <peter@FreeBSD.org>	Checkpoint of a new physical memory backed object type, that does not have pv_entries. This is intended for very special circumstances, eg: a certain database that has a 1GB shm segment mapped into 300 processes. That would consume 2GB of kvm just to hold the pv_entries alone. This would not be used on systems unless the physical ram was available, as it's not pageable. This is a work-in-progress, but is a useful and functional checkpoint. Matt has got some more fixes for it that will be committed soon. Reviewed by: dillon
# 0385347c	20-May-2000	Peter Wemm <peter@FreeBSD.org>	Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly to various pmap_*() functions instead of looking up the physical address and passing that. In many cases, the first thing the pmap code was doing was going to a lot of trouble to get back the original vm_page_t, or it's shadow pv_table entry. Inspired by: John Dyson's 1998 patches. Also: Eliminate pv_table as a seperate thing and build it into a machine dependent part of vm_page_t. This eliminates having a seperate set of structions that shadow each other in a 1:1 fashion that we often went to a lot of trouble to translate from one to the other. (see above) This happens to save 4 bytes of physical memory for each page in the system. (8 bytes on the Alpha). Eliminate the use of the phys_avail[] array to determine if a page is managed (ie: it has pv_entries etc). Store this information in a flag. Things like device_pager set it because they create vm_page_t's on the fly that do not have pv_entries. This makes it easier to "unmanage" a page of physical memory (this will be taken advantage of in subsequent commits). Add a function to add a new page to the freelist. This could be used for reclaiming the previously wasted pages left over from preloaded loader(8) files. Reviewed by: dillon
# 25db2c54	27-Mar-2000	Matthew Dillon <dillon@FreeBSD.org>	Add necessary spl protection for swapper. The problem was located by Alfred while testing his SPLASSERT stuff. This is not a complete fix, more protections are probably needed.
# 5929bcfa	27-Mar-2000	Philippe Charnier <charnier@FreeBSD.org>	Revert spelling mistake I made in the previous commit Requested by: Alan and Bruce
# 956f3135	26-Mar-2000	Philippe Charnier <charnier@FreeBSD.org>	Spelling
# 6bdfe06a	11-Dec-1999	Eivind Eklund <eivind@FreeBSD.org>	Lock reporting and assertion changes. * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked. This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS. Discussed with: grog, mch, peter, phk Reviewed by: peter
# be72f788	30-Oct-1999	Alan Cox <alc@FreeBSD.org>	The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to eliminate an extra (useless) level of indirection in half of the page queue accesses and (2) to use a single name for each queue throughout, instead of, e.g., "vm_page_queue_active" in some places and "vm_page_queues[PQ_ACTIVE]" in others. Reviewed by: dillon
# 923502ff	29-Oct-1999	Poul-Henning Kamp <phk@FreeBSD.org>	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 90ecac61	16-Sep-1999	Matthew Dillon <dillon@FreeBSD.org>	Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com> Replace various VM related page count calculations strewn over the VM code with inlines to aid in readability and to reduce fragility in the code where modules depend on the same test being performed to properly sleep and wakeup. Split out a portion of the page deactivation code into an inline in vm_page.c to support vm_page_dontneed(). add vm_page_dontneed(), which handles the madvise MADV_DONTNEED feature in a related commit coming up for vm_map.c/vm_object.c. This code prevents degenerate cases where an essentially active page may be rotated through a subset of the paging lists, resulting in premature disposal.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# aeea9b36	21-Aug-1999	Alan Cox <alc@FreeBSD.org>	Remove two unused variable declarations.
# bfbacbd9	16-Aug-1999	Alan Cox <alc@FreeBSD.org>	vm_pageout_clean: Remove dead code. Submitted by: dillon
# e929c00d	03-Jul-1999	Kirk McKusick <mckusick@FreeBSD.org>	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 9c8b8baa	01-Jul-1999	Peter Wemm <peter@FreeBSD.org>	Slight reorganization of kernel thread/process creation. Instead of using SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
# d50c1994	26-Jun-1999	Peter Wemm <peter@FreeBSD.org>	There isn't much point waking up a daemon that hasn't existed since softupdates came in. Try calling speedup_syncer() instead..
# 11a9f83f	23-Apr-1999	Dmitrij Tejblum <dt@FreeBSD.org>	Make pmap_collect() an official pmap interface.
# c8da68e9	05-Apr-1999	Peter Wemm <peter@FreeBSD.org>	Don't forcibly kill processes that are locked in-core via PHOLD - it was just checking P_NOSWAP before.
# 51df5949	11-Mar-1999	Julian Elischer <julian@FreeBSD.org>	Stop the mfs from trying to swap out crucial bits of the mfs as this can lead to deadlock. Submitted by: Mat dillon <dillon@freebsd.org>
# fe2144fd	19-Feb-1999	Luoqi Chen <luoqi@FreeBSD.org>	Eliminate a possible numerical overflow.
# b1028ad1	19-Feb-1999	Luoqi Chen <luoqi@FreeBSD.org>	Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This is the preparation step for moving pmap storage out of vmspace proper. Reviewed by: Alan Cox <alc@cs.rice.edu> Matthew Dillion <dillon@apollo.backplane.com>
# faa273d5	07-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with PQ_FREE. There is little operational difference other then the kernel being a few kilobytes smaller and the code being more readable. * vm_page_select_free() has been greatly simplified. * The PQ_ZERO page queue and supporting structures have been removed * vm_page_zero_idle() revamped (see below) PG_ZERO setting and clearing has been migrated from vm_page_alloc() to vm_page_free[_zero]() and will eventually be guarenteed to remain tracked throughout a page's life ( if it isn't already ). When a page is freed, PG_ZERO pages are appended to the appropriate tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended. When locating a new free page, PG_ZERO selection operates from within vm_page_list_find() ( get page from end of queue instead of beginning of queue ) and then only occurs in the nominal critical path case. If the nominal case misses, both normal and zero-page allocation devolves into the same _vm_page_list_find() select code without any specific zero-page optimizations. Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been added and zero-page tracking adjusted to conform with the other changes. Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free pages. We may wish to increase both parameters as time permits. The hysteresis is designed to avoid silly zeroing in borderline allocation/free situations.
# 9fdfe602	07-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to attempt to optimize forks but were essentially given-up on due to problems and replaced with an explicit dup of the vm_map_entry structure. Prior to the removal, they were entirely unused.
# 7dbf82dc	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Change all manual settings of vm_page_t->dirty = VM_PAGE_BITS_ALL to use the vm_page_dirty() inline. The inline can thus do sanity checks ( or not ) over all cases.
# d044d7bf	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Added warning printf ( needs INVARIANTS ) when busy cache page is found while trying to free memory.
# aaba53da	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	It is possible for a page in the cache to be busy. vm_pageout.c was not checking for this condition while it tried to free cache pages. Fixed.
# a489f761	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Reorganized some of the low memory testing code to make it more useful. Removed call to vm_object_collapse(), which can block. This was being called without the pageout code holding any sort of reference on the vm_object or vm_page_t structures being manipulated. Since this code can block, it was possible for other kernel code to shred the state the pageout code was assuming remained intact. Fixed potential blocking condition in vm_pageout_page_free() ( which could cause a deadlock in a low-memory situation ). Currently there is a hack in-place to deal with clean filesystem meta-data polluting the inactive page queue. John doesn't like the hack, and neither do I. Revamped and commented a portion of the pageout loop. Added protection against potential memory deadlocks with OBJT_VNODE when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL which causes VOP_ISLOCKED() to return a less informed answer. remove vm_pager_sync() -- none of the pagers use it any more ( the old swapper used to. The new one does not ).
# 1c7c3c6a	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# b0359e2c	31-Oct-1998	Peter Wemm <peter@FreeBSD.org>	Add John Dyson's SYSCTL descriptions, and an export of more stats to a sysctl hierarchy (vm.stats.*). SYSCTL descriptions are only present in source, they do not get compiled into the binaries taking up memory.
# f5ef029e	25-Oct-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# faa5f8d8	29-Sep-1998	Andrzej Bialecki <abial@FreeBSD.org>	Make #define NO_SWAPPING a normal kernel config option. Reviewed by: jkh
# e69763a3	04-Sep-1998	Doug Rabson <dfr@FreeBSD.org>	Cosmetic changes to the PAGE_XXX macros to make them consistent with the other objects in vm.
# 069e9bc1	24-Aug-1998	Doug Rabson <dfr@FreeBSD.org>	Change various syscalls to use size_t arguments instead of u_int. Add some overflow checks to read/write (from bde). Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags and vm_object::paging_in_progress to use operations which are not interruptable. Reviewed by: Bruce Evans <bde@zeta.org.au>
# d474eaaa	06-Aug-1998	Doug Rabson <dfr@FreeBSD.org>	Protect all modifications to paging_in_progress with splvm(). The i386 managed to avoid corruption of this variable by luck (the compiler used a memory read-modify-write instruction which wasn't interruptable) but other architectures cannot. With this change, I am now able to 'make buildworld' on the alpha (sfx: the crowd goes wild...)
# 427e99a0	10-Jul-1998	Alexander Langer <alex@FreeBSD.org>	Removed unnecessary test from if/else construct. PR: 7233 Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
# e8f36785	01-Jun-1998	John Dyson <dyson@FreeBSD.org>	Correct sleep priority.
# 227ee8a1	30-Mar-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
# bef608bd	15-Mar-1998	John Dyson <dyson@FreeBSD.org>	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
# be01eafd	08-Mar-1998	John Dyson <dyson@FreeBSD.org>	Quell unneeded pageout daemon activity.
# 8f9110f6	07-Mar-1998	John Dyson <dyson@FreeBSD.org>	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
# ffc82b0a	28-Feb-1998	John Dyson <dyson@FreeBSD.org>	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
# a15403de	24-Feb-1998	John Dyson <dyson@FreeBSD.org>	Correct some severe VM tuning problems for small systems (<=16MB), and improve tuning on larger systems. (A couple of the VM tuning params for small systems were so badly chosen that the system could hang under load.) The broken tuning was originaly my fault.
# e47ed70b	23-Feb-1998	John Dyson <dyson@FreeBSD.org>	Significantly improve the efficiency of the swap pager, which appears to have declined due to code-rot over time. The swap pager rundown code has been clean-up, and unneeded wakeups removed. Lots of splbio's are changed to splvm's. Also, set the dynamic tunables for the pageout daemon to be more sane for larger systems (thereby decreasing the daemon overheadla.)
# 303b270b	08-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Staticize.
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 95461b45	04-Feb-1998	John Dyson <dyson@FreeBSD.org>	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# eaf13dd7	31-Jan-1998	John Dyson <dyson@FreeBSD.org>	Change the busy page mgmt, so that when pages are freed, they MUST be PG_BUSY. It is bogus to free a page that isn't busy, because it is in a state of being "unavailable" when being freed. The additional advantage is that the page_remove code has a better cross-check that the page should be busy and unavailable for other use. There were some minor problems with the collapse code, and this plugs those subtile "holes." Also, the vfs_bio code wasn't checking correctly for PG_BUSY pages. I am going to develop a more consistant scheme for grabbing pages, busy or otherwise. For now, we are stuck with the current morass.
# 2d8acc0f	22-Jan-1998	John Dyson <dyson@FreeBSD.org>	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 47221757	17-Jan-1998	John Dyson <dyson@FreeBSD.org>	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
# 925a3a41	11-Jan-1998	John Dyson <dyson@FreeBSD.org>	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 95e5e988	05-Jan-1998	John Dyson <dyson@FreeBSD.org>	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 2be70f79	28-Dec-1997	John Dyson <dyson@FreeBSD.org>	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# b44e4b7a	24-Dec-1997	John Dyson <dyson@FreeBSD.org>	Support running with inadequate swap space. Additionally, the code will complain with a suggestion of increasing it.
# ceb0cf87	05-Dec-1997	John Dyson <dyson@FreeBSD.org>	Support an optional, sysctl enabled feature of idle process swapout. This is apparently useful for large shell systems, or systems with long running idle processes. To enable the feature: sysctl -w vm.swap_idle_enabled=1 Please note that some of the other vm sysctl variables have been renamed to be more accurate. Submitted by: Much of it from Matt Dillon <dillon@best.net>
# 70111b90	04-Dec-1997	John Dyson <dyson@FreeBSD.org>	Add new (very useful) tunable for pageout daemon. The flag changes the maximum pageout rate: sysctl -w vm.vm_maxlaunder=n 1 < n < inf. If paging heavily on large systems, it is likely that a performance improvement can be achieved by increasing the parameter. On a large system, the parm is 32, but numbers as large as 128 can make a big difference. If paging is expensive, you might try decreasing the number to 1-8.
# 12ac6a1d	04-Dec-1997	John Dyson <dyson@FreeBSD.org>	Support applications that need to resist or deny use of swap space. sysctl -w vm.defer_swap_pageouts=1 Causes the system to resist the use of swap space. In low memory conditions, performance will decrease. sysctl -w vm.disable_swap_pageouts=1 Causes the system to mostly disable the use of swap space. In low memory conditions, the system will likely start killing processes.
# 5985940e	24-Oct-1997	John Dyson <dyson@FreeBSD.org>	Support garbage collecting the pmap pv entries. The management doesn't happen until the system would have nearly failed anyway, so no signficant overhead is added. This helps large systems with lots of processes.
# 7e006499	05-Oct-1997	John Dyson <dyson@FreeBSD.org>	Improve management of pages moving from the inactive to active queue. Additionally, add some much needed comments.
# 79624e21	31-Aug-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# dc2efb27	26-Jul-1997	John Dyson <dyson@FreeBSD.org>	Add the ability for the pageout daemon to measure stats on memory usage before the system is out of memory. The daemon does a minimal amount of work that increases as the system becomes more likely to run out of memory and page in/out. The default tuning is fairly low in background CPU usage, and sysctl variables have been added to enable flexable operation. This is an experimental feature that will likely be changed and improved over time.
# 2f558c3e	27-Feb-1997	Bruce Evans <bde@FreeBSD.org>	Removed a wrong LK_INTERLOCK flag.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 996c772f	09-Feb-1997	John Dyson <dyson@FreeBSD.org>	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# afa07f7e	15-Jan-1997	John Dyson <dyson@FreeBSD.org>	Change the map entry flags from bitfields to bitmasks. Allows for some code simplification.
# 16a02c11	15-Jan-1997	Bruce Evans <bde@FreeBSD.org>	Removed redundant spl0()'s from kernel processes. They were work-arounds for a bug in fork().
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# d4a272db	11-Jan-1997	John Dyson <dyson@FreeBSD.org>	Slightly correct the code that moves pages from the active to the inactive queue. This is only a minor performance improvement, but will not affect perf on machines that don't have ref bits.
# 106031ef	03-Jan-1997	John Dyson <dyson@FreeBSD.org>	Undo the collapse breakage (swap space usage problem.)
# 3c018e72	31-Dec-1996	John Dyson <dyson@FreeBSD.org>	Guess what? We left alot of the old collapse code that is not needed anymore with the "full" collapse fix that we added about 1yr ago!!! The code has been removed by optioning it out for now, so we can put it back in ASAP if any problems are found.
# e0c5a895	28-Nov-1996	John Dyson <dyson@FreeBSD.org>	Make the kernel smaller with at worst a neutral effect on perf by de-inlining some VM calls. (Actually, I measured a small improvement.)
# a2f4a846	27-Sep-1996	John Dyson <dyson@FreeBSD.org>	Reviewed by: Submitted by: Obtained from:
# 5070c7f8	08-Sep-1996	John Dyson <dyson@FreeBSD.org>	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)
# 67bf6868	29-Jul-1996	John Dyson <dyson@FreeBSD.org>	Backed out the recent changes/enhancements to the VM code. The problem with the 'shell scripts' was found, but there was a 'strange' problem found with a 486 laptop that we could not find. This commit backs the code back to 25-jul, and will be re-entered after the snapshot in smaller (more easily tested) chunks.
# 4f4d35ed	26-Jul-1996	John Dyson <dyson@FreeBSD.org>	This commit is meant to solve a couple of VM system problems or performance issues. 1) The pmap module has had too many inlines, and so the object file is simply bigger than it needs to be. Some common code is also merged into subroutines. 2) Removal of some evil PHYS_TO_VM_PAGE macro calls. Unfortunately, a few have needed to be added also. The removal caused the need for more vm_page_lookups. I added lookup hints to minimize the need for the page table lookup operations. 3) Removal of some bogus performance improvements, that mostly made the code more complex (tracking individual page table page updates unnecessarily). Those improvements actually hurt 386 processors perf (not that people who worry about perf use 386 processors anymore :-)). 4) Changed pv queue manipulations/structures to be TAILQ's. 5) The pv queue code has had some performance problems since day one. Some significant scalability issues are resolved by threading the pv entries from the pmap AND the physical address instead of just the physical address. This makes certain pmap operations run much faster. This does not affect most micro-benchmarks, but should help loaded system performance significantly. DG helped and came up with most of the solution for this one. 6) Most if not all pmap bit operations follow the pattern: pmap_test_bit(); pmap_clear_bit(); That made for twice the necessary pv list traversal. The pmap interface now supports only pmap_tc_bit type operations: pmap_[test/clear]_modified, pmap_[test/clear]_referenced. Additionally, the modified routine now takes a vm_page_t arg instead of a phys address. This eliminates a PHYS_TO_VM_PAGE operation. 7) Several rewrites of routines that contain redundant code to use common routines, so that there is a greater likelihood of keeping the cache footprint smaller.
# 502ba6e4	07-Jul-1996	John Dyson <dyson@FreeBSD.org>	Back-off on the previous commit, specifically remove the look-ahead optimization on the active queue scan. I will do this correctly later.
# c8c4b40c	07-Jul-1996	John Dyson <dyson@FreeBSD.org>	Fix a problem with the pageout daemon RSS limiting, where it degrades performance to LRU or worse when RSS limiting takes effect. Also, make an end condition in the active queue scan more efficient in the case where pages are removed from the active queue as a side effect of a pmap operation.
# 01155bd7	29-Jun-1996	David Greenman <dg@FreeBSD.org>	Make sure we have an object in the map entry before trying to trim pages from it.
# 38efa82b	25-Jun-1996	John Dyson <dyson@FreeBSD.org>	This commit does a couple of things: Re-enables the RSS limiting, and the routine is now tail-recursive, making it much more safe (eliminates the possiblity of kernel stack overflow.) Also, the RSS limiting is a little more intelligent about finding the likely objects that are pushing the process over the limit. Added some sysctls that help with VM system tuning. New sysctl features: 1) Enable/disable lru pageout algorithm. vm.pageout_algorithm = 0, default algorithm that works well, especially using X windows and heavy memory loading. Can have adverse effects, sometimes slowing down program loading. vm.pageout_algorithm = 1, close to true LRU. Works much better than clock, etc. Does not work as well as the default algorithm in general. Certain memory "malloc" type benchmarks work a little better with this setting. Please give me feedback on the performance results associated with these. 2) Enable/disable swapping. vm.swapping_enabled = 1, default. vm.swapping_enabled = 0, useful for cases where swapping degrades performance. The config option "NO_SWAPPING" is still operative, and takes precedence over the sysctl. If "NO_SWAPPING" is specified, the sysctl still exists, but "vm.swapping_enabled" is hard-wired to "0". Each of these can be changed "on the fly."
# a001376d	23-Jun-1996	John Dyson <dyson@FreeBSD.org>	Remove RSS limiting until I rewrite the code to be non-recursive. The code can overrun the kernel stack under very stressful conditions.
# ef743ce6	16-Jun-1996	John Dyson <dyson@FreeBSD.org>	Several bugfixes/improvements: 1) Make it much less likely to miss a wakeup in vm_page_free_wakeup 2) Create a new entry point into pmap: pmap_ts_referenced, eliminates the need to scan the pv lists twice in many cases. Perhaps there is alot more to do here to work on minimizing pv list manipulation 3) Minor improvements to vm_pageout including the use of pmap_ts_ref. 4) Major changes and code improvement to pmap. This code has had several serious bugs in page table page manipulation. In order to simplify the problem, and hopefully solve it for once and all, page table pages are no longer "managed" with the pv list stuff. Page table pages are only (mapped and held/wired) or (free and unused) now. Page table pages are never inactive, active or cached. These changes have probably fixed the hold count problems, but if they haven't, then the code is simpler anyway for future bugfixing. 5) The pmap code has been sorely in need of re-organization, and I have taken a first (of probably many) steps. Please tell me if you have any ideas.
# f35329ac	30-May-1996	John Dyson <dyson@FreeBSD.org>	This commit is dual-purpose, to fix more of the pageout daemon queue corruption problems, and to apply Gary Palmer's code cleanups. David Greenman helped with these problems also. There is still a hang problem using X in small memory machines.
# 545901f7	29-May-1996	John Dyson <dyson@FreeBSD.org>	Correct some unfortunately chosen constants, otherwise, not enough pages are calculated for deferred allocation of swap pager data structures. This is a follow-on to the previous commit to this file.
# b182ec9e	28-May-1996	John Dyson <dyson@FreeBSD.org>	After careful review by David Greenman and myself, David had found a case where blocking can occur, thereby giving other process's a chance to modify the queue where a page resides. This could cause numerous process and system failures.
# 85a376eb	26-May-1996	John Dyson <dyson@FreeBSD.org>	Fix a couple of problems in the pageout_scan routine. First, there is a condition when blocking can occur, and the daemon did not check properly for a page remaining on the expected queue. Additionally, the inactive target was being set much too large for small memory machines. It is now being calculated based upon the amount of user memory available on every pageout daemon run. Another problem was that if memory was very low, the pageout daemon could fail repeatedly to traverse the inactive queue.
# 1eeaa1e3	23-May-1996	John Dyson <dyson@FreeBSD.org>	Add apparently needed splvm protection to the active queue, and eliminate an unnecessary test for dirty pages if it is already known to be dirty.
# b18bfc3d	17-May-1996	John Dyson <dyson@FreeBSD.org>	This set of commits to the VM system does the following, and contain contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>, Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me: More usage of the TAILQ macros. Additional minor fix to queue.h. Performance enhancements to the pageout daemon. Addition of a wait in the case that the pageout daemon has to run immediately. Slightly modify the pageout algorithm. Significant revamp of the pmap/fork code: 1) PTE's and UPAGES's are NO LONGER in the process's map. 2) PTE's and UPAGES's reside in their own objects. 3) TOTAL elimination of recursive page table pagefaults. 4) The page directory now resides in the PTE object. 5) Implemented pmap_copy, thereby speeding up fork time. 6) Changed the pv entries so that the head is a pointer and not an entire entry. 7) Significant cleanup of pmap_protect, and pmap_remove. 8) Removed significant amounts of machine dependent fork code from vm_glue. Pushed much of that code into the machine dependent pmap module. 9) Support more completely the reuse of already zeroed pages (Page table pages and page directories) as being already zeroed. Performance and code cleanups in vm_map: 1) Improved and simplified allocation of map entries. 2) Improved vm_map_copy code. 3) Corrected some minor problems in the simplify code. Implemented splvm (combo of splbio and splimp.) The VM code now seldom uses splhigh. Improved the speed of and simplified kmem_malloc. Minor mod to vm_fault to avoid using pre-zeroed pages in the case of objects with backing objects along with the already existant condition of having a vnode. (If there is a backing object, there will likely be a COW... With a COW, it isn't necessary to start with a pre-zeroed page.) Minor reorg of source to perhaps improve locality of ref.
# bd105bb7	11-Apr-1996	Bruce Evans <bde@FreeBSD.org>	Fixed a spl hog. The vmdaemon process ran entirely at splhigh. It sometimes disabled clock interrupts for 60 msec or more on a P133. Clock interrupts were lost ... Reviewed by: dyson
# 30dcfc09	27-Mar-1996	John Dyson <dyson@FreeBSD.org>	VM performance improvements, and reorder some operations in VM fault in anticipation of a fix in pmap that will allow the mlock system call to work without panicing the system.
# 8169788f	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change.
# 1b67ec6d	10-Mar-1996	Jeffrey Hsu <hsu@FreeBSD.org>	For Lite2: proc LIST changes. Reviewed by: davidg & bde
# 6ac5bfdb	08-Mar-1996	John Dyson <dyson@FreeBSD.org>	Fix a calculation for a paging parameter.
# 5afce282	22-Feb-1996	David Greenman <dg@FreeBSD.org>	Add a "NO_SWAPPING" option to disable swapping. This was originally done to help diagnose a problem on wcarchive (where the kernel stack was sometimes not present), but is useful in its own right since swapping actually reduces performance on some systems (such as wcarchive). Note: swapping in this context means making the U pages pageable and has nothing to do with generic VM paging, which is unaffected by this option. Reviewed by: <dyson>
# 729b1e51	30-Jan-1996	David Greenman <dg@FreeBSD.org>	Improved killproc() log message and made it and the other similar message tolerant of p_ucred being invalid. Starting using killproc() where appropriate.
# bd7e5f99	18-Jan-1996	John Dyson <dyson@FreeBSD.org>	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# f708ef1b	14-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Another mega commit to staticize things.
# a316d390	10-Dec-1995	John Dyson <dyson@FreeBSD.org>	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# efeaf95a	06-Dec-1995	David Greenman <dg@FreeBSD.org>	Untangled the vm.h include file spaghetti.
# 3af76890	19-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused vars & funcs, make things static, protoize a little bit.
# aef922f5	05-Nov-1995	John Dyson <dyson@FreeBSD.org>	Greatly simplify the msync code. Eliminate complications in vm_pageout for msyncing. Remove a bug that manifests itself primarily on NFS (the dirty range on the buffers is not set on msync.)
# a91c5a7e	22-Oct-1995	John Dyson <dyson@FreeBSD.org>	Get rid of machine-dependent NBPG and replace with PAGE_SIZE.
# cd41fc12	07-Oct-1995	David Greenman <dg@FreeBSD.org>	Fix argument passing to the "freeer" routine. Added some prototypes. (bde) Moved extern declaration of swap_pager_full into swap_pager.h and out of the various files that reference it. (davidg) Submitted by: bde & davidg
# a5eb0e27	06-Oct-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Avoid a 64bit divide.
# 4590fd3a	09-Sep-1995	David Greenman <dg@FreeBSD.org>	Fixed init functions argument type - caddr_t -> void *. Fixed a couple of compiler warnings.
# 2b14f991	28-Aug-1995	Julian Elischer <julian@FreeBSD.org>	Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular.. NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases.. certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task) The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
# 24a1cce3	13-Jul-1995	David Greenman <dg@FreeBSD.org>	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 6306c897	10-Jul-1995	David Greenman <dg@FreeBSD.org>	swapout_threads() -> swapout_procs().
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# 61f5d510	21-May-1995	David Greenman <dg@FreeBSD.org>	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
# ee3a64c9	10-May-1995	David Greenman <dg@FreeBSD.org>	Changed "handle" from type caddr_t to void ; "handle" is several different types of pointers, and "char " is a bad choice for the type.
# 4c1f8ee9	17-Apr-1995	David Greenman <dg@FreeBSD.org>	Fixed a logic bug that caused the vmdaemon to not wake up when intended. Submitted by: John Dyson
# 7c0414d0	16-Apr-1995	David Greenman <dg@FreeBSD.org>	Removed obsolete/unused variable declarations. Killed externs and included appropriate include files.
# c3cb3e12	15-Apr-1995	David Greenman <dg@FreeBSD.org>	Moved some zero-initialized variables into .bss. Made code intended to be called only from DDB #ifdef DDB. Removed some completely unused globals.
# f6b04d2b	09-Apr-1995	David Greenman <dg@FreeBSD.org>	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# 17c4c408	27-Mar-1995	David Greenman <dg@FreeBSD.org>	Fixed typo...using wrong variable in page_shortage calculation.
# 0bb3a0d2	27-Mar-1995	David Greenman <dg@FreeBSD.org>	Fixed "pages freed by daemon" statistic (again).
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 61ca29b0	12-Mar-1995	David Greenman <dg@FreeBSD.org>	Deleted vm_object_setpager().
# f919ebde	01-Mar-1995	David Greenman <dg@FreeBSD.org>	Various changes from John and myself that do the following: New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that are used to reduce the actual number of wakeups. New function vm_page_protect which is used in conjuction with some new page flags to reduce the number of calls to pmap_page_protect. Minor changes to reduce unnecessary spl nesting. Rewrote vm_page_alloc() to improve readability. Various other mostly cosmetic changes.
# 4f9fb771	25-Feb-1995	Bruce Evans <bde@FreeBSD.org>	Don't use __P(()) in a function definition.
# 6f2b142e	22-Feb-1995	David Greenman <dg@FreeBSD.org>	vm_page.c: Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding if the caller is "important" in vm_page_alloc(). Also established a new low threshold for non-interrupt allocations via cnt.v_interrupt_free_min. vm_pageout.c: Various algorithmic cleanup. Some calculations simplified. Initialize cnt.v_interrupt_free_min to 2 pages. Submitted by: John Dyson
# c0503609	22-Feb-1995	David Greenman <dg@FreeBSD.org>	Only do object paging_in_progress wakeups if someone is waiting on this condition. Submitted by: John Dyson
# 0c1dacbc	20-Feb-1995	David Greenman <dg@FreeBSD.org>	Moved ACT_MAX, ACT_ADVANCE, and ACT_DECLINE to vm_page.h.
# d2fc5315	13-Feb-1995	Poul-Henning Kamp <phk@FreeBSD.org>	YF fix.
# 1ed81ef2	09-Feb-1995	David Greenman <dg@FreeBSD.org>	Minor algorithmic adjustments that reduce the CPU consumption of the pagedaemon in half while not reducing its effectiveness. Submitted by: me & John
# a1f6d91c	02-Feb-1995	David Greenman <dg@FreeBSD.org>	swap_pager.c: Fixed long standing bug in freeing swap space during object collapses. Fixed 'out of space' messages from printing out too often. Modified to use new kmem_malloc() calling convention. Implemented an additional stat in the swap pager struct to count the amount of space allocated to that pager. This may be removed at some point in the future. Minimized unnecessary wakeups. vm_fault.c: Don't try to collect fault stats on 'swapped' processes - there aren't any upages to store the stats in. Changed read-ahead policy (again!). vm_glue.c: Be sure to gain a reference to the process's map before swapping. Be sure to lose it when done. kern_malloc.c: Added the ability to specify if allocations are at interrupt time or are 'safe'; this affects what types of pages can be allocated. vm_map.c: Fixed a variety of map lock problems; there's still a lurking bug that will eventually bite. vm_object.c: Explicitly initialize the object fields rather than bzeroing the struct. Eliminated the 'rcollapse' code and folded it's functionality into the "real" collapse routine. Moved an object_unlock() so that the backing_object is protected in the qcollapse routine. Make sure nobody fools with the backing_object when we're destroying it. Added some diagnostic code which can be called from the debugger that looks through all the internal objects and makes certain that they all belong to someone. vm_page.c: Fixed a rather serious logic bug that would result in random system crashes. Changed pagedaemon wakeup policy (again!). vm_pageout.c: Removed unnecessary page rotations on the inactive queue. Changed the number of pages to explicitly free to just free_reserved level. Submitted by: John Dyson
# 8f895206	27-Jan-1995	David Greenman <dg@FreeBSD.org>	Completed the fix for attempting to page out pages via the device_pager. Submitted by: John Dyson
# 6d40c3d3	24-Jan-1995	David Greenman <dg@FreeBSD.org>	Added ability to detect sequential faults and DTRT. (swap_pager.c) Added hook for pmap_prefault() and use symbolic constant for new third argument to vm_page_alloc() (vm_fault.c, various) Changed the way that upages and page tables are held. (vm_glue.c) Fixed architectural flaw in allocating pages at interrupt time that was introduced with the merged cache changes. (vm_page.c, various) Adjusted some algorithms to acheive better paging performance and to accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c) Fixed pbuf handling problem, changed policy on handling read-behind page. (vnode_pager.c) Submitted by: John Dyson
# 480dff54	10-Jan-1995	David Greenman <dg@FreeBSD.org>	Fixed some formatting weirdness that I overlooked in the previous commit.
# 0d94caff	09-Jan-1995	David Greenman <dg@FreeBSD.org>	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 832f3afd	02-Jan-1995	Andreas Schulz <ats@FreeBSD.org>	Submitted by: Ben Jackson just a missing newline in a kernel printf added.
# eadf9e27	25-Nov-1994	David Greenman <dg@FreeBSD.org>	These changes fix a couple of lingering VM problems: 1. The pageout daemon used to block under certain circumstances, and we needed to add new functionality that would cause the pageout daemon to block more often. Now, the pageout daemon mostly just gets rid of pages and kills processes when the system is out of swap. The swapping, rss limiting and object cache trimming have been folded into a new daemon called "vmdaemon". This new daemon does things that need to be done for the VM system, but can block. For example, if the vmdaemon blocks for memory, the pageout daemon can take care of it. If the pageout daemon had blocked for memory, it was difficult to handle the situation correctly (and in some cases, was impossible). 2. The collapse problem has now been entirely fixed. It now appears to be impossible to accumulate unnecessary vm objects. The object collapsing now occurs when ref counts drop to one (where it is more likely to be more simple anyway because less pages would be out on disk.) The original fixes were incomplete in that pathological circumstances could still be contrived to cause uncontrolled growth of swap. Also, the old code still, under steady state conditions, used more swap space than necessary. When using the new code, users will generally notice a significant decrease in swap space usage, and theoretically, the system should be leaving fewer unused pages around competing for memory. Submitted by: John Dyson
# 79221631	16-Nov-1994	David Greenman <dg@FreeBSD.org>	Don't ever try to kill off process 1 - even if we are out of swap space and it's the candidate pig.
# b0150bfc	13-Nov-1994	David Greenman <dg@FreeBSD.org>	Set laundry flag when transitioning an inactive page from clean to dirty. This fixes a performance bug where pages would sometimes not be paged out when they could be. Submitted by: John Dyson
# 2fe6e4d7	05-Nov-1994	David Greenman <dg@FreeBSD.org>	Added support for starting the experimental "vmdaemon" system process. Enabled via REL2_1. Added support for doing object collapses "on the fly". Enabled via REL2_1a. Improved object collapses so that they can happen in more cases. Improved sensing of modified pages to fix an apparant race condition and improve clustered pageout opportunities. Fixed an "oops" with not restarting page scan after a potential block in vm_pageout_clean() (not doing this can result in strange behavior in some cases). Submitted by: John Dyson & David Greenman
# 191ee5b3	24-Oct-1994	David Greenman <dg@FreeBSD.org>	#if 0'd out the object cache trimming code - there are multiple ways that the pageout daemon can deadlock otherwise. Submitted by: John Dyson
# e8fbe458	23-Oct-1994	David Greenman <dg@FreeBSD.org>	Fixed object cache trimming policy so it actually works. Submitted by: John Dyson
# 389918ee	23-Oct-1994	David Greenman <dg@FreeBSD.org>	Adjusted reserved levels to fix a deadlock condition. Submitted by: John Dyson
# 5663e6de	21-Oct-1994	David Greenman <dg@FreeBSD.org>	Various changes to allow operation without any swapspace configured. Note that this is intended for use only in floppy situations and is done at the sacrifice of performance in that case (in ther words, this is not the best solution, but works okay for this exceptional situation). Submitted by: John Dyson
# a58d1fa1	18-Oct-1994	David Greenman <dg@FreeBSD.org>	Fix the remaining vmmeter counters. They all now work correctly.
# 976e77fc	15-Oct-1994	David Greenman <dg@FreeBSD.org>	1) Some of the counters in the vmmeter struct don't fit well into the Mach VM scheme of things, so I've changed them to be more appropriate. page in/ous are now associated with the pager that did them. Nuked v_fault as the only fault of interest that wouldn't be already counted in v_trap is a VM fault, and this is counted seperately. 2) Implemented most of the remaining counters and corrected the counting of some that were done wrong. They are all almost correct now...just a few minor ones left to fix.
# 28e12d63	13-Oct-1994	David Greenman <dg@FreeBSD.org>	Fixed an object reference count problem that was caused by a call to vm_object_lookup() being outside of some parens. The bug was introduced via some recently added code. Reviewed by: John Dyson
# 05f0fdd2	08-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc. Reviewed by: davidg
# 4e39a515	07-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Cosmetics. Unused vars and other warnings.
# 5cedf680	05-Oct-1994	David Greenman <dg@FreeBSD.org>	Fixed minor bug caused by some missing parens that can result in slightly reduced paging performance by missing a clustering opportunity. Found by Poul-Henning Kamp with gcc -Wall.
# edaaafdb	03-Oct-1994	David Greenman <dg@FreeBSD.org>	Fixed bug related to proper sensing of page modification that we inadvertantly introduced in pre-1.1.5. This could cause page modifications to go unnoticed during certain extreme low memory/high paging rate conditions. Submitted by: John Dyson and David Greenman
# 11b224dc	12-Sep-1994	David Greenman <dg@FreeBSD.org>	Fixed a bug I introduced when fixing the rss limit code. Changed swapout policy to be a bit more selective about what processes get swapped out. Reviewed by: John Dyson
# ed74321b	12-Sep-1994	David Greenman <dg@FreeBSD.org>	Don't deactivate pages in 0-refcount objects. Added a couple of missing paging stats. Fixed problem with free_reserved becoming depleted during certain swap_pager operations. Submitted by: John Dyson, with a little help from me
# a647a309	06-Sep-1994	David Greenman <dg@FreeBSD.org>	Simple changes to paging algorithms...but boy do they make a difference. FreeBSD's paging performance has never been better. Wow. Submitted by: John Dyson
# a6ca859e	30-Aug-1994	David Greenman <dg@FreeBSD.org>	Fixed bug caused by change of rlimit variables to quad_t's. The bug was in using min() to calculate the minimum of rss_cur,rss_max - since these are now quad_t's and min() takes u_ints...the comparison later for exceeding the rss limit was always true - resulting in rather serious page thrashing. Now using new qmin() function for this purpose. Fixed another bug where PG_BUSY pages would sometimes be paged out (bad!). This was caused by the PG_BUSY flag not being included in a comparison.
# f23b4c91	18-Aug-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 16f62314	06-Aug-1994	David Greenman <dg@FreeBSD.org>	Incorporated post 1.1.5 work from John Dyson. This includes performance improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/ pmap_kremove. These routine allow fast mapping of pages for those architectures that have "normal" MMUs. Also included is a fix to the pageout daemon to properly check a queue end condition. Submitted by: John Dyson
# bbc0ec52	03-Aug-1994	David Greenman <dg@FreeBSD.org>	Integrated VM system improvements/fixes from FreeBSD-1.1.5.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 03e6c253	01-Aug-1994	David Greenman <dg@FreeBSD.org>	Removed all code related to the pagescan daemon, and changed 'act_count' adjustments to compensate for a world without the pagescan daemon.
# 0a620217	06-Jun-1994	David Greenman <dg@FreeBSD.org>	Don't move the page's position in the active queue if it is busy or held. John has noticed some stability problems when doing this.
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources