History log of /freebsd-current/sys/vm/vm_pageout.c
Revision Date Author Comments
# a216e311 24-May-2024 Ryan Libby <rlibby@FreeBSD.org>

vm_pageout_scan_inactive: take a lock break

In vm_pageout_scan_inactive, release the object lock when we go to
refill the scan batch queue so that someone else has a chance to acquire
it. This improves access latency to the object when the pagedaemon is
processing many consecutive pages from a single object, and also in any
case avoids a hiccup during refill for the last touched object.

Reviewed by: alc, markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45288


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 1cac76c9 14-Dec-2022 Andrew Gallatin <gallatin@FreeBSD.org>

vm: reduce lock contention when processing vm batchqueues

Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.

So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.

This has been run for several months on a busy Netflix server, as well
as on my personal desktop.

Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305


# 0cb2610e 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm: Remove handling for OBJT_DEFAULT objects

Now that OBJT_DEFAULT objects can't be instantiated, we can simplify
checks of the form object->type == OBJT_DEFAULT || (object->flags &
OBJ_SWAP) != 0. No functional change intended.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35788


# e123264e 19-Jun-2022 Mark Johnston <markj@FreeBSD.org>

vm: Fix racy checks for swap objects

Commit 4b8365d752ef introduced the ability to dynamically register
VM object types, for use by tmpfs, which creates swap-backed objects.
As a part of this, checks for such objects changed from

object->type == OBJT_DEFAULT || object->type == OBJT_SWAP

to

object->type == OBJT_DEFAULT || (object->flags & OBJ_SWAP) != 0

In particular, objects of type OBJT_DEFAULT do not have OBJ_SWAP set;
the swap pager sets this flag when converting from OBJT_DEFAULT to
OBJT_SWAP.

A few of these checks are done without the object lock held. It turns
out that this can result in false negatives since the swap pager
converts objects like so:

object->type = OBJT_SWAP;
object->flags |= OBJ_SWAP;

Fix the problem by adding explicit tests for OBJT_SWAP objects in
unlocked checks.

PR: 258932
Fixes: 4b8365d752ef ("Add OBJT_SWAP_TMPFS pager")
Reported by: bdrewery
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35470


# b51927b7 10-Feb-2022 Konstantin Belousov <kib@FreeBSD.org>

Revert "vm_pageout_scans: correct detection of active object"

This reverts commit 3de96d664aaaf8e3fb1ca4fc4bd864d2cf734b24.

Problem is that it is possible to reach the state with ref_count ==
1 for the mapped non-anonymous object. For instance, anonymous posix
shmfd or linux shmfs object could be mapped, and then corresponding
file descriptor closed, dropping the object reference owned by the
shmfd/shmfs file. Then the check in inactive scan assumes that the
object and page are not mapped and frees the page, while they are not.

PR: 261707
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: now


# 3de96d66 16-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

vm_pageout_scans: correct detection of active object

For non-anonymous swap objects, there is always a reference from the
owner to the object to keep it from recycling. Account for it when
deciding should we query pmap for hardware active references for the
page.

As result, we avoid unneeded calls to pmap_ts_referenced(), which for
non-mapped page means avoiding unneccessary lock and unlock of the pv list.

Reviewed by: markj
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33924


# 4a864f62 14-Jan-2022 Mark Johnston <markj@FreeBSD.org>

vm_pageout: Print a more accurate message to the console before an OOM kill

Previously we'd always print "out of swap space." This can be
misleading, as there are other reasons an OOM kill can be triggered. In
particular, it's entirely possible to trigger an OOM kill on a system
with plenty of free swap space.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33810


# c4a25e07 11-Jan-2022 Mark Johnston <markj@FreeBSD.org>

vm_pageout: Group sysctl variables together with sysctl definitions

Fix some style bugs while here. No functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33811


# fa7a635f 14-Aug-2021 Gordon Bergling <gbe@FreeBSD.org>

Fix a few typos in source code comments

- s/becase/because/

MFC after: 5 days


# 0ef5eee9 03-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

Add vn_lktype_write()

and remove repetetive code that calculates vnode locking type for write.

Reviewed by: khng, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31405


# 7079449b 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

sys/vm: remove several other uses of OBJT_SWAP_TMPFS

Mostly in cases where OBJ_SWAP flag works as well, or by reversing the
condition so that object types can be listed.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 4b8365d7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add OBJT_SWAP_TMPFS pager

This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.

Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.

This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 2913cc46 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

vm_pageout: Avoid rounding down the inactive scan target

With helper page daemon threads, enabled by default in r364786, we
divide the inactive target by the number of threads, rounding down, and
sum the total number of pages freed by the threads. This sum is
compared with the original target, but by rounding down we might lose
pages, causing the page daemon control loop to conclude that inactive
queue scanning isn't keeping up with demand for free pages. Typically
this results in excessive swapping.

Fix the problem by accounting for the error in the main pagedaemon
thread's target. Note that by default the problem will manifest only in
systems with >16 CPUs in a NUMA domain.

Reviewed by: cem
Discussed with: dougm
Reported and tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26610


# 97458520 17-Sep-2020 Mark Johnston <markj@FreeBSD.org>

Increase the default vm.max_user_wired value.

Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit. However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit.

The purpose of the limit is to provide a seatbelt, not to impose some
policy on the use of wired memory. Thus, increase the default limit to
allow reasonable VM configurations to work without tuning.

Reviewed by: kib
Discussed with: dougm
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26424


# 609de97e 28-Aug-2020 Eric van Gyzen <vangyzen@FreeBSD.org>

vm_pageout_scan_active: ensure ps_delta is initialized

Reported by: Coverity
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26212


# 74f5530d 25-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

vm_pageout: Scale worker threads with CPUs

Autoscale vm_pageout worker threads from r364129 with CPU count. The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable.

There will never be less than 1 thread per populated NUMA domain, and
the previous arbitrary upper limit (at most ncpus/2 threads per NUMA
domain) is preserved.

Care is taken to gracefully handle asymmetric NUMA nodes, such as empty
node systems (e.g., AMD 2990WX) and systems with nodes of varying size
(e.g., some larger >20 core Intel Haswell/Broadwell Xeon).

Reviewed by: kib, markj
Sponsored by: Isilon
Differential Revision: https://reviews.freebsd.org/D26152


# a92a971b 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the thread argument from vget

It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)


# ea7b737a 14-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

vm_pageout: Correct threshold calculation on single-CPU systems

Reported by: Michael Butler
X-MFC-With: r364129


# 0292c54b 11-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

Add support for multithreading the inactive queue pageout within a domain.

In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing. Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon. It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.

I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems. For now I would like to commit this infrastructure with only a
single thread enabled.

The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.

Submitted by: jeff (earlier version)
Discussed with: markj
Tested by: pho
Sponsored by: probably Netflix (based on contemporary commits)
Differential Revision: https://reviews.freebsd.org/D21629


# efec381d 04-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove most lingering references to the page lock in comments.

Finish updating comments to reflect new locking protocols introduced
over the past year. In particular, vm_page_lock is now effectively
unused.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25868


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 23ed568c 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: remove no longer needed atomic_load_ptr casts


# 3c200db9 10-Feb-2020 Jonathan T. Looney <jtl@FreeBSD.org>

Modify the vm.panic_on_oom sysctl to take a count of events.

Currently, the vm.panic_on_oom sysctl is a boolean which controls the
behavior of the VM system when it encounters an out-of-memory situation.
If set to 0, the VM system kills the largest process. If set to any other
value, the VM system will initiate a panic.

This change makes the sysctl a count of events. If set to 0, the VM system
kills the largest process. If set to any other value, the VM system will
kill the largest process until it has seen the specified number of
out-of-memory events. Once it reaches the specified number of events, it
will initiate a panic.

This change is helpful in capturing cores when the system is in a perpetual
cycle of out-of-memory events (as opposed to just hitting one or two
sporadic out-of-memory events).

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23601


# ace409ce 13-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Restore loop break in vm_pageout_lowmem().

r355004 removed return statement from this loop with intention to also
call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes
infinite loop.

Reviewed by: markj
Sponsored by: iXsystems, Inc.


# f7607c30 02-Jan-2020 Mark Johnston <markj@FreeBSD.org>

Clear queue operation flags when migrating a page to another queue.

The page daemon loops may move pages back to the active queue if
references are detected. In this case we must take care to clear
existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be
set, and that flag is only valid if the page belongs to the inactive
queue.

Also fix a bug in the active queue scan where we were updating "old"
instead of "new". This would only have been hit in rare cases where the
page moved out of the active queue after the beginning of the scan.

Reported by: Bob Prohaska, Idwer Vollering
Tested by: Idwer Vollering
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D23001


# 9f5632e6 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking for queue operations.

With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885


# b7f30bff 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Generalize lazy dequeue logic for wired pages.

Some recent work aims to remove the use of the page lock for
synchronizing updates to page queue state. This change adds a mechanism
to preserve the existing behaviour of lazily dequeuing wired pages,
which was previously synchronized using the page lock.

Handle this by setting PGA_DEQUEUE when a managed page's wire count
transitions from 0 to 1. When the page daemon encounters a page with a
flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that
page, but in so doing it does not modify the page itself and thus racing
with a concurrent free of the page is harmless. The flag is advisory;
the page daemon still checks for wirings after acquiring the object and
page xbusy locks.

vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition.
It must do this before dropping the reference to avoid a use-after-free
but also handles races with concurrent wirings to ensure that
PGA_DEQUEUE is not left unset on a wired page.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22882


# f3f38e25 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Start implementing queue state updates using fcmpset loops.

This is in preparation for eliminating the use of the vm_page lock for
protecting queue state operations.

Introduce the vm_page_pqstate_commit_*() functions. These functions act
as helpers around vm_page_astate_fcmpset() and are specialized for
specific types of operations. vm_page_pqstate_commit() wraps these
functions.

Convert a number of routines to use these new helpers. Use
vm_page_release_toq() in vm_page_unwire() and vm_page_release() to
atomically release a wiring reference and release the page into a queue.
This has the side effect that vm_page_unwire() will leave the page in
the active queue if it is already present there.

Convert the page queue scans to use the new helpers. Simplify
vm_pageout_reinsert_inactive(), which requeues pages that were found to
be busy during an inactive queue scan, to avoid duplicating the work of
vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or
PGA_REQUEUE_HEAD is set, let that be handled during batch processing.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22770
Differential Revision: https://reviews.freebsd.org/D22771
Differential Revision: https://reviews.freebsd.org/D22772
Differential Revision: https://reviews.freebsd.org/D22773
Differential Revision: https://reviews.freebsd.org/D22776


# a8081778 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# 9c770a27 22-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Simplify vm_pageout_init_domain() and add a "big picture" comment.

Stop subtracting 1024/200 from vmd_page_count/200. I cannot see how
such precise accounting can make a difference on modern systems.

Add some explanation of what the page daemon does and how it handles
memory shortages.

Reviewed by: dougm
Discussed with: jeff, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22396


# 8fc25508 22-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Reclaim memory from UMA if the page daemon is struggling.

Use the UMA reclaim thread to asynchronously drain all caches if
there is a severe shortage in a domain. Otherwise we only trigger UMA
reclamation every 10s even when the system has completely run out of
memory.

Stop entirely draining the caches when one domain falls below its min
threshold. In some workloads it is normal for one NUMA domain to end
up being nearly depleted by kernel memory allocations, for example for
the ZFS ARC. The domainset iterators skip domains below the
vmd_min_free theshold on the first iteration, so we should allow that
mechanism to limit further depletion of the domain's free pages before
taking the extreme step of calling uma_reclaim(UMA_RECLAIM_DRAIN_CPU).

Discussed with: jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22395


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# 2288078c 08-Oct-2019 Doug Moore <dougm@FreeBSD.org>

Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 7cdeaf33 03-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Add preliminary support for atomic updates of per-page queue state.

Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page. Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.

Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon. In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks. However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required. vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.

Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset(). Use it in the active queue scan, which
does not use the object lock. Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.

Reviewed by: kib
Discussed with: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21257


# 08cfa56e 01-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Extend uma_reclaim() to permit different reclamation targets.

The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667


# 245139c6 16-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix OOM handling of some corner cases.

In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals. Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously. The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.

Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13671


# eeacb3b0 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Merge the vm_page hold and wire mechanisms.

The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended. __FreeBSD_version is bumped.

Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247


# 0cab71bc 06-Jul-2019 Doug Moore <dougm@FreeBSD.org>

Fix style(9) violations involving division by PAGE_SIZE.

Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20847


# d70f0ab3 03-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Cache the next queue element when traversing a page queue.

When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element
invalidates its queue linkage pointers. vm_pageout_collect_batch()
was relying on these pointers remaining valid after a removal, so
modify it to fetch the next queued page before dequeuing the current
page.

Submitted by: Don Morris <dgmorris@earthlink.net>
Reviewed by: cem, vangyzen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20842


# d842aa51 01-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a vm_page_wired() predicate.

Use it instead of accessing the wire_count field directly. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485


# 54a3a114 13-May-2019 Mark Johnston <markj@FreeBSD.org>

Provide separate accounting for user-wired pages.

Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908


# 64f8d257 02-May-2019 Doug Moore <dougm@FreeBSD.org>

fls() should find the most significant bit of an int faster than a
linear search can, so use it to avoid a linear search in isqrt.

Approved by: kib (mentor), markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20102


# 46e39081 21-Feb-2019 Mark Johnston <markj@FreeBSD.org>

Clear pointers to indicate that the respective locks are released.

This fixes a problem in r344231: vm_pageout_launder() may scan two
queues when swap is disabled.

Reported by: pho
MFC with: r344231


# 60256604 17-Feb-2019 Mark Johnston <markj@FreeBSD.org>

Remove a redundant flag variable.

Use the object pointer itself to determine whether the object is locked.
No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19215


# 8002c3a4 05-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Initialize last_target in the laundry thread control loop.

In practice it is always initialized because nfreed must be positive
in order to trigger background laundering, but this isn't obvious.

CID: 1387997
MFC after: 1 week


# 920239ef 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Fix some problems that manifest when NUMA domain 0 is empty.

- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.

Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704


# 30c5525b 01-Oct-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Allow empty NUMA memory domains to support Threadripper2

The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.

Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683


# 899fe184 23-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Add a per-pagequeue pdpages counter.

Expose these counters under the vm.domain sysctl node. The existing
vm.stats.vm.v_pdpages sysctl is preserved.

Reviewed by: alc (previous version)
Differential Revision: https://reviews.freebsd.org/D14666


# b50a4ea6 09-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Account for the lowmem handlers in the inactive queue scan target.

Before r329882 the target would be computed after lowmem handlers run
and free pages. On some systems a significant amount of page
reclamation happens this way. However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.

Instead, adjust the target after running lowmem handlers. Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.

Reviewed by: alc, kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16606


# d7aeb429 15-Jul-2018 Alan Cox <alc@FreeBSD.org>

Test PGA_REFERENCED after calling pmap_ts_referenced(), rather than before,
so that a reference from a concurrently destroyed mapping is observed
during the current scan.

Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16277


# 27e29d10 04-Jun-2018 Mark Johnston <markj@FreeBSD.org>

Correct the description of vm_pageout_scan_inactive() after r334508.

Reported by: alc


# 49a3710c 01-Jun-2018 Mark Johnston <markj@FreeBSD.org>

Remove the "pass" variable from the page daemon control loop.

It serves little purpose after r308474 and r329882. As a side
effect, the removal fixes a bug in r329882 which caused the
page daemon to periodically invoke lowmem handlers even in the
absence of memory pressure.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D15491


# 7bb4634e 24-May-2018 Mark Johnston <markj@FreeBSD.org>

Update r334154 with review feedback from D15490.

An old revision was committed by accident.

Differential Revision: https://reviews.freebsd.org/D15490


# be37ee79 24-May-2018 Mark Johnston <markj@FreeBSD.org>

Split the active and inactive queue scans into separate subroutines.

The scans are largely independent, so this helps make the code
marginally neater, and makes it easier to incorporate feedback from the
active queue scan into the page daemon control loop.

Improve some comments while here. No functional change intended.

Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D15490


# 01f04471 18-May-2018 Mark Johnston <markj@FreeBSD.org>

Don't increment addl_page_shortage for wired pages.

Such pages are dequeued as they're encountered during the inactive queue
scan, so by the time we get to the active queue scan, they should have
already been subtracted from the inactive queue length.

Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15479


# 36f8fe9b 13-May-2018 Mark Johnston <markj@FreeBSD.org>

Get rid of vm_pageout_page_queued().

vm_page_queue(), added in r333256, generalizes vm_pageout_page_queued(),
so use it instead. No functional change intended.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D15402


# 1b5c869d 04-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix some races introduced in r332974.

With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280


# 5cd29d0f 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Improve VM page queue scalability.

Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893


# 64b38930 19-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Initialize marker pages in vm_page_domain_init().

They were previously initialized by the corresponding page daemon
threads, but for vmd_inacthead this may be too late if
vm_page_deactivate_noreuse() is called during boot.

Reported and tested by: cperciva
Reviewed by: alc, kib
MFC after: 1 week


# c098768e 02-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Ensure the background laundering threshold is positive after a scan.

The division added in r331732 meant that we wouldn't attempt a
background laundering until at least v_free_target - v_free_min clean
pages had been freed by the page daemon since the last laundering. If
the inactive queue is depleted but not completely empty (e.g., because
it contains busy pages), it can thus take a long time to meet this
threshold. Restore the pre-r331732 behaviour of using a non-zero
background laundering threshold if at least one inactive queue scan has
elapsed since the last attempt at background laundering.

Submitted by: tijl (original version)


# 60684862 29-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Fix the background laundering mechanism after r329882.

Rather than using the number of inactive queue scans as a metric for
how many clean pages are being freed by the page daemon, have the
page daemon keep a running counter of the number of pages it has freed,
and have the laundry thread use that when computing the background
laundering threshold.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D14884


# 30fbfdda 15-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Eliminate pageout wakeup races. Take another step towards lockless
vmd_free_count manipulation. Reduce the scope of the free lock by
using a pageout lock to synchronize sleep and wakeup. Only trigger
the pageout daemon on transitions between states. Drive all wakeup
operations directly as side-effects from freeing memory rather than
requiring an additional function call.

Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14612


# 3b8cf4ac 27-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Give the 0th domain's page daemon thread a consistent name.

Page daemon threads for other domains show up in ps(1) output as
"pagedaemon/domN", so let that be the case for domain 0 as well.

Submitted by: Kevin Bowling <kevin.bowling@kev009.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14518


# 59d3150b 24-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Restore the pre-r329882 inactive page shortage computation.

With r329882, in the absence of a free page shortage we would only take
len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to
aggressively scan PQ_ACTIVE. Previously we would also include the
number of free pages in this computation, ensuring that we wouldn't scan
PQ_ACTIVE with plenty of free memory available. The change in behaviour
was most noticeable immediately after booting, when PQ_INACTIVE and
PQ_LAUNDRY are nearly empty.

Reviewed by: jeff


# 5f8cd1c0 23-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.

This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.

Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402


# 2c0f13aa 20-Feb-2018 Konstantin Belousov <kib@FreeBSD.org>

vm_wait() rework.

Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.

Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.

Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.

Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.

Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# b6715dab 13-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA.

Sponsored by: Netflix, Dell/EMC Isilon
Discussed with: jhb


# 5c515efc 29-Dec-2017 Alan Cox <alc@FreeBSD.org>

After r327168, the variable "vm_pageout_wanted" can be static.

MFC after: 2 weeks


# 65ef3231 26-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Ensure that pass > 0 when starting a scan with vm_pages_needed == 1.

Otherwise the page daemon will not reclaim pages and thus will not
wake threads sleeping in VM_WAIT.

Reported and tested by: pho
Reviewed by: alc, kib
X-MFC with: r327168
Differential Revision: https://reviews.freebsd.org/D13640


# 280d15cd 24-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Fix two problems with the page daemon control loop.

Both issues caused the page daemon to erroneously go to sleep when
applications are consuming free pages at a high rate, leaving the
application threads blocked in VM_WAIT.

1) After completing an inactive queue scan, concurrent allocations may
have prevented the page daemon from meeting the v_free_min threshold.
In this case, the page daemon was going to sleep even when the
inactive queue contained plenty of clean pages.
2) pagedaemon_wakeup() may be called without the free queues lock held.
This can lead to a lost wakeup if a call occurs after the page daemon
clears vm_pageout_wanted but before going to sleep.

Fix 1) by ensuring that we start a new inactive queue scan immediately
if v_free_count < v_free_min after a prior scan.

Fix 2) by adding a new subroutine, pagedaemon_wait(), called from
vm_wait() and vm_waitpfault(). It wakes up the page daemon if either
vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps
on v_free_count.

Reported by: jeff
Reviewed by: alc
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13424


# cb35676e 11-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Use a dedicated counter for inactive queue scans.

The laundry thread keeps track of the number of inactive queue scans
performed by the page daemon, and was previously using the v_pdwakeups
counter to count them. However, in some cases the inactive queue may
be scanned multiple times after a single wakeup, so it's more accurate
to use a dedicated counter.

Reviewed by: alc, kib (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13422


# 82e2d06a 09-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Fix the act_scan_laundry_weight mechanism.

r292392 modified the active queue scan to weigh clean pages differently
from dirty pages when attempting to meet the inactive queue target. When
r306706 was merged into the PQ_LAUNDRY branch, this mechanism was
broken. Fix it by scalaing the correct page shortage variable.

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13423


# 6eebec83 06-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Use unique wait messages in the page daemon control loop.

Discussed with: alc
MFC after: 1 week


# 796df753 30-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

SPDX: Consider code from Carnegie-Mellon University.

Interesting cases, most likely from CMU Mach sources.


# 57cd81a3 29-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Verify the object/vnode association after vget() in vm_pageout_clean().

It's theoretically possible for the vnode and object to be disassociated
while locks are dropped around the vget() call, in which case we
shouldn't proceed with laundering.

Noted and reviewed by: kib
MFC after: 1 week


# df57947f 18-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

spdx: initial adoption of licensing ID tags.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133


# e0b2fc3a 08-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Allow various page daemon parameters to be set from loader.conf.

MFC after: 1 week


# ac04195b 20-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Move swapout code into vm/vm_swapout.c.

There is no NO_SWAPPING #ifdef left in the code.

Requested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12663


# aed9aaaa 28-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Synchronize page laundering with pmap_extract_and_hold().

Before r207410, the hold count of a page in a page queue was protected
by the queue lock, and, before laundering a page, the page daemon
removed managed writeable mappings of the page before releasing the
queue lock. This ensured that other threads could not concurrently
create transient writeable mappings using pmap_extract_and_hold() on a
user map, as is done for example by vmapbuf(). With that revision,
however, a race can allow the creation of such a mapping, meaning that
the page might be modified as it is being laundered, potentially
resulting in it being marked clean when its contents do not match
those given to the pager. Close the race by using the page lock to
synchronize the hold count check in vm_pageout_cluster() with the
removal of writeable managed mappings.

Reported by: alc
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12084


# e2241590 24-Jun-2017 Alan Cox <alc@FreeBSD.org>

Increase the pageout cluster size to 32 pages.

Decouple the pageout cluster size from the size of the hash table entry
used by the swap pager for mapping (object, pindex) to a block on the
swap device(s), and keep the size of a hash table entry at its current
size.

Eliminate a pointless macro.

Reviewed by: kib, markj (an earlier version)
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D11305


# 3e78e983 05-Jun-2017 Alan Cox <alc@FreeBSD.org>

The variable "breakout" is used like a Boolean, so actually define it as
one.

Reviewed by: kib
MFC after: 5 days


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# 9b43bc27 25-Feb-2017 Andriy Gapon <avg@FreeBSD.org>

call vm_lowmem hook in uma_reclaim_worker

A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.

Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook. No handler makes use of the flags yet.

Reviewed by: markj, kib
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9764


# 937c1b07 14-Feb-2017 Andriy Gapon <avg@FreeBSD.org>

try to fix RACCT_RSS accounting

There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero. In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.

Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state. Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.

PR: 210315
Reviewed by: trasz
Differential Revision: https://reviews.freebsd.org/D9464


# b1fd102e 02-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Add a page queue for holding dirty anonymous unswappable pages.

On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by: alc


# 99e6e193 23-Nov-2016 Mark Johnston <markj@FreeBSD.org>

Release laundered vnode pages to the head of the inactive queue.

The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589


# ebcddc72 09-Nov-2016 Alan Cox <alc@FreeBSD.org>

Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302


# 28323add 08-Nov-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Fix improper use of "its".

Sponsored by: Dell EMC Isilon


# 70cf3ced 05-Oct-2016 Alan Cox <alc@FreeBSD.org>

Make the page daemon's notion of what kind of pass is being performed
by vm_pageout_scan() local to vm_pageout_worker(). There is no reason
to store the pass in the NUMA domain structure.

Reviewed by: kib
MFC after: 3 weeks


# e57dd910 05-Oct-2016 Alan Cox <alc@FreeBSD.org>

Change vm_pageout_scan() to return a value indicating whether the free page
target was met.

Previously, vm_pageout_worker() itself checked the length of the free page
queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan
freed enough pages to meet the free page target. Specifically,
vm_pageout_worker() used vm_paging_needed(). The trouble with
vm_paging_needed() is that it compares the length of the free page queues to
the wakeup threshold for the page daemon, which is much lower than the free
page target. Consequently, vm_pageout_worker() could conclude that the
inactive queue scan succeeded in meeting its free page target when in fact
it did not; and rather than immediately triggering an all-out laundering
pass over the inactive queue, vm_pageout_worker() would go back to sleep
waiting for the free page count to fall below the page daemon wakeup
threshold again, at which point it will perform another limited (pass == 1)
scan over the inactive queue.

Changing vm_pageout_worker() to use vm_page_count_target() instead of
vm_paging_needed() won't work because any page allocations that happen
concurrently with the inactive queue scan will result in the free page count
being below the target at the end of a successful scan. Instead, having
vm_pageout_scan() return a value indicating success or failure is the most
straightforward fix.

Reviewed by: kib, markj
MFC after: 3 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8111


# 79144408 11-Aug-2016 Alan Cox <alc@FreeBSD.org>

Correct errors and clean up the comments on the active queue scan.

Eliminate some unnecessary blank lines.

Reviewed by: kib, markj
MFC after: 1 week


# f0edf3f8 05-Aug-2016 Alan Cox <alc@FreeBSD.org>

Correct a spelling error.


# 248fe642 04-Aug-2016 Alan Cox <alc@FreeBSD.org>

Clean up the comments and code style in and around vm_pageout_cluster().
In particular, fix factual, grammatical, and spelling errors in various
comments, and remove comments that are out of place in this function.

Reviewed by: kib, markj
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7410


# 87ff568c 01-Aug-2016 Alan Cox <alc@FreeBSD.org>

Restore the historical behavior of "sysctl vm.swap_idle_enabled=1". Prior
to r254304, we had separate functions for reclamation and laundering
(vm_pageout_scan) versus updating usage information, i.e., "reference
bits", on active pages (vm_pageout_page_stats), and we only performed
vm_req_vmdaemon(VM_SWAP_IDLE) if vm_pages_needed was true. However, since
r254303, if vm_swap_idle_enabled was "1", we have performed
vm_req_vmdaemon(VM_SWAP_IDLE) regardless of whether we are short of free
pages. This was unintended and too aggressive, so I suspect no one uses
this feature. With this change, we restore the historical behavior and
only perform vm_req_vmdaemon(VM_SWAP_IDLE) when we are short of free
pages.

Reviewed by: kib, markj


# 793172ea 29-Jul-2016 Alan Cox <alc@FreeBSD.org>

Remove a probe declaration that has been unused since r292469, when
vm_pageout_grow_cache() was replaced.

MFC after: 3 days


# f095d1bb 28-Jul-2016 Alan Cox <alc@FreeBSD.org>

Remove any mention of cache (PG_CACHE) pages from the comments in
vm_pageout_scan(). That function has not cached pages since r284376.

MFC after: 3 days


# 3ac8f842 27-Jul-2016 Mark Johnston <markj@FreeBSD.org>

De-pluralize "queues" where appropriate in the pagedaemon code.

MFC after: 1 week


# a766ffd0 26-Jul-2016 Alan Cox <alc@FreeBSD.org>

Update a comment to reflect r284376.

MFC after: 3 days


# 44be0a8e 23-Jul-2016 Mark Johnston <markj@FreeBSD.org>

Correct a comment - each page queue has its own lock.

Reviewed by: alc
MFC after: 3 days


# 20c58db9 19-Jul-2016 Mark Johnston <markj@FreeBSD.org>

Make vm_pageout_wakeup_thresh a u_int rather than an int.

It's a threshold for v_free_count, which is of type u_int. This also lets
us get rid of a cast in vm_paging_needed().

Reviewed by: alc
MFC after: 1 week


# 95e2409a 22-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix a LOR between vnode locks and allproc_lock.

There is an order between covered vnode lock and allproc_lock, which
is established by calling mountcheckdirs() while owning the covered
vnode lock. mountcheckdirs() iterates over the processes, protected by
allproc_lock. This order is needed and seems to be not avoidable.

On the other hand, various VM daemons also need to iterate over all
processes, and they lock and unlock user maps. Since unlock of the
user map may trigger processing of the deferred map entries, it causes
vnode locking to occur. Or, when vmspace is freed, dropping references
on the vnode-backed object also lock vnodes. We get reverted order
comparing with the mount/unmount order.

For VM daemons, there is no need to own allproc_lock while we operate
on vmspaces. If the process is held, it serves as the marker for
allproc list, which allows to continue the iteration.

Add _PHOLD_LITE() macro, similar to _PHOLD(), but not causing swap-in
of the kernel stacks. It is used instead of _PHOLD() in vm code,
since e.g. calling faultin() in OOM conditions only exaggerates the
problem.

Modernize comment describing PHOLD.

Reported by: lists@yamagi.org
Tested by: pho (previous version)
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 week
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6679


# 56ce0690 27-May-2016 Alan Cox <alc@FreeBSD.org>

The flag "vm_pages_needed" has long served two distinct purposes: (1) to
indicate that threads are waiting for free pages to become available and
(2) to indicate whether a wakeup call has been sent to the page daemon.
The trouble is that a single flag cannot really serve both purposes, because
we have two distinct targets for when to wakeup threads waiting for free
pages versus when the page daemon has completed its work. In particular,
the flag will be cleared by vm_page_free() before the page daemon has met
its target, and this can lead to the OOM killer being invoked prematurely.
To address this problem, a new flag "vm_pageout_wanted" is introduced.

Discussed with: jeff
Reviewed by: kib, markj
Tested by: markj
Sponsored by: EMC / Isilon Storage Division


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# 62d70a81 09-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Add more fine-grained kernel options for NUMA support.

VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782


# c869e672 19-Dec-2015 Alan Cox <alc@FreeBSD.org>

Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
memory allocator's free page lists. This replaces the long-standing
approach of scanning the inactive and inactive queues, converting clean
pages into PG_CACHED pages and laundering dirty pages. In contrast, the
new mechanism does not use PG_CACHED pages nor does it trigger a large
number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
free pages in the physical memory allocator's free page lists that are
covered by the direct map. Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places. For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by: kib, markj
Glanced at: jhb
Discussed with: jeff
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4444


# 2eb2f0d5 27-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

In vm_pageout_grow_cache(), do not re-try the inactive queue when
active queue scan initiated write.

Re-trying from the inactive queue when doing active scan makes the
loop never end if number of domains is greater than 1 and inactive or
active scan cannot reach the target.

Reported and tested by: Andrew Gallatin <gallatin@netflix.com>
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 7672ca05 21-Nov-2015 Mark Johnston <markj@FreeBSD.org>

Remove unneeded includes of opt_kdtrace.h.

As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h
is not needed when defining SDT(9) probes.


# 76386c7e 15-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Rework the test which raises OOM condition. Right now, the code
checks for the swap space consumption plus checks that the amount of
the free pages exceeds some limit, in case pagedeamon did not coped
with the page shortage in one of the late passes. This is wrong
because it does not account for the presence of the reclamaible pages
in the queues which are not selectable for reclaim immediately. E.g.,
on the swap-less systems, large active queue easily triggered OOM.

Instead, only raise OOM when pagedaemon is unable to produce a free
page in several back-to-back passes. Track the failed passes per
pagedaemon thread.

The number of passes to trigger OOM was selected empirically and
tested both on small (32M-64M i386 VM) and large (32G amd64)
configurations. If the specifics of the load require tuning, sysctl
vm.pageout_oom_seq sets the number of back-to-back passes which must
fail before OOM is raised. Each pass takes 1/2 of seconds. Less the
value, more sensible the pagedaemon is to the page shortage.

In future, some heuristic to calculate the value of the tunable might
be designed based on the system configuration and load. But before it
can be done, the i/o system must be fixed to reliably time-out
pagedaemon writes, even if waiting for the memory to proceed. Then,
code can account for the in-flight page-outs and postpone OOM until
all of them finished, which should reduce the need in tuning. Right
now, ignoring the in-flight writes and the counter allows to break
deadlocks due to write path doing sleepable memory allocations.

Reported by: Dmitry Sivachenko, bde, many others
Tested by: pho, bde, tuexen (arm)
Reviewed by: alc
Discussed with: bde, imp
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 3949873f 15-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Do not use vmspace_resident_count() for the OOM process selection.
Residency count track the number of pte entries installed into the
current pmap, which does not reflect the consumption of the physical
memory by the address map. Due to several mechanisms like pv entries
reclamation, copy on write etc. the resident pte entries count may be
much less than the amount of physical memory kept by the process.

Provide the OOM-specific vm_pageout_oom_pagecount() function which
estimates the amount of reclamaible memory which could be stolen if
the process is killed.

Reported and tested by: pho
Reviewed by: alc
Comments text by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# b98acc0a 15-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

VM daemon works in parallel with the pagedaemon threads, and, among
other actions, swaps out kernel stacks of the processes. On the other
hand, currentl OOM logic which selects a process to kill in the
critical condition, skips process with swapped-out thread. Under some
loads, this results in the big(gest) process being ignored by OOM.

Do not skip a process which has inhibited thread due to the swap-out,
in the OOM selection loop. Note that killing such process requires
the thread stack page-in, but sometimes this is the only way to
recover.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 7e78597f 07-Nov-2015 Mark Johnston <markj@FreeBSD.org>

Ensure that deactivated pages that are not expected to be reused are
reclaimed in FIFO order by the pagedaemon. Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.

Reviewed by: alc
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division


# 69b8585e 18-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

Only marker is guaranteed to be present on the queue after the relock
in vm_pageout_fallback_object_lock() and vm_pageout_page_lock(). The
check for the m->queue == queue assumes that the page does belong to a
queue.

Modify the 'unchanged' calculation bu dereferencing the marker tailq
pointers, which is known to belong to the queue. Since for a page m
linked to the queue, m->queue must be equal to the queue index, assert
this instead of checking.

In collaboration with: alc
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 2 weeks


# 8748f58c 15-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

Revert r289302, invalid pages can be queued, e.g. by vfs_vmio_unwire().

Found by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 12a73f20 14-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

Invalid pages should not appear on the inactive queue. Change the
check into an assertion.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# bc727596 03-Oct-2015 Alan Cox <alc@FreeBSD.org>

Reduce the scope of a variable to the only file where it is used.


# d9347bca 20-Sep-2015 Alan Cox <alc@FreeBSD.org>

Correct a non-fatal error in vm_pageout_worker(). vm_pageout_worker()
should not assume that vm_pages_needed will remain set while it sleeps.
Other threads can clear vm_pages_needed by performing a sufficient
number of vm_page_free() calls, e.g., process termination. The effect
of this error was that vm_pageout_worker() would free and/or launder
pages when, in fact, there was no shortage of free pages.

Rewrite a nearby comment to describe all of the possible cases and not
just the most common case. The problem being that the comment made
the most common case seem like the only case.

Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# 27a9fb2f 07-Sep-2015 Alan Cox <alc@FreeBSD.org>

To simplify upcoming changes to the inactive queue scan, change the code
so that there is only one place where pages are freed and only one place
where pages are moved to the tail of the queue.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 960810cc 05-Sep-2015 Alan Cox <alc@FreeBSD.org>

Eliminate pointless requeueing of pages from terminated objects. These
pages will have left the inactive queue before the page daemon performs
its next scan. Also, ignore references to pages from terminated objects.
This allows the clean pages to be freed a little sooner.

Move some comments to their proper place, i.e., next to the code that
they describe, and update other nearby comments.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# a3aeedab 01-Sep-2015 Alan Cox <alc@FreeBSD.org>

Handle held pages earlier in the inactive queue scan.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 40aa80a7 27-Aug-2015 Alan Cox <alc@FreeBSD.org>

In vm_pageout_scan(), simplify the logic for determining if a page can be
paged out and apply some nearby style fixes.

In collaboration with: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation, EMC / Isilon Storage Division


# eb5d3969 24-Aug-2015 Alan Cox <alc@FreeBSD.org>

Testing whether a page is dirty does not require the page lock. Moreover,
it may involve a pmap operation that iterates over the page's PV list, so
unnecessarily holding the page lock is undesirable.

MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# a6bf3a9e 20-Aug-2015 Ryan Stone <rstone@FreeBSD.org>

Prevent ticks rollover from preventing vm_lowmem event

Currently vm_pageout_scan() uses a ticks-based scheme to rate-limit
the number of times that the vm_lowmem event will happen. However
if no events happen for long enough for ticks to roll over, this
leaves us in a long window in which vm_lowmem events will not
happen.

Replace the use of ticks with time_t to prevent rollover from ever
being an issue.

Reviewed by: ian
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3439


# f9b11500 16-Aug-2015 Alan Cox <alc@FreeBSD.org>

As another piece of PG_CACHE page elimination, remove an LRU-defeating call
to vm_page_try_to_cache() from vm_pageout_flush(). Other changes, most
recently r286814, have made this call unnecessary.

Reviewed by: kib
Discussed with: jeff
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 22cf98d1 08-Jul-2015 Alan Cox <alc@FreeBSD.org>

The intention of r254304 was to scan the active queue continuously.
However, I've observed the active queue scan stopping when there are
frequent free page shortages and the inactive queue is steadily refilled
by other mechanisms, such as the sequential access heuristic in vm_fault()
or madvise(2). To remedy this problem, record the time of the last active
queue scan, and always scan a number of pages proportional to the time
since the last scan, regardless of whether that last scan was a
timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass >
0") scan.

Also, on a timeout-triggered scan, allow a full scan of the active queue
when the system is short of inactive pages.

Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division


# aa044135 20-Jun-2015 Alan Cox <alc@FreeBSD.org>

Avoid pmap_is_modified() on pages that can't be mapped.

MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# 776f729c 14-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

Invalid pages do not need neither update of the activation count nor
they coould be dirty. Move the handling if the invalid pages in the
inactive scan earlier.

Remove some code duplication in the scan by introducing the
'drop_page' label, which centralizes the object and the page unlock.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 78afdce6 13-Jun-2015 Alan Cox <alc@FreeBSD.org>

As the next step in eliminating PG_CACHE pages, free rather than cache
pages in vm_pageout_scan(). The reactivation rate of cache pages created
by vm_pageout_scan() is extremely low; typically no more than 0.5% to
2.25% of the pages are ever reactivated. At the same time, caching pages
is more expensive than freeing them. For example, in a test with
PostgreSQL, this change reduced the amount of time spent in the inactive
queue scan by 1/6.

Differential Revision: https://reviews.freebsd.org/D2805
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 44ec2b63 09-May-2015 Konstantin Belousov <kib@FreeBSD.org>

The vmem callback to reclaim kmem arena address space on low or
fragmented conditions currently just wakes up the pagedaemon. The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still. The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.

To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly. The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 4b5c9cf6 29-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 34d8b7ea 06-Apr-2015 Jeff Roberson <jeff@FreeBSD.org>

- Simplify vm_pageout_scan() by introducing a new vm_pageout_clean()
function that does the locking and validation associated with cleaning
a page. This moves 150 lines of code into its own function.
- Rename vm_pageout_clean() to vm_pageout_cluster() to define what it
really does; clustering nearby pages for pageout optimization.

Reviewd by: alc, kib, kmacy
Tested by: pho (earlier version)
Sponsored by: EMC / Isilon


# b3de46ab 27-Mar-2015 Jeff Roberson <jeff@FreeBSD.org>

- Eliminate pagequeue locking in the dirty code in vm_pageout_scan().
- Use a more precise series of tests to see if the page changed while we
were locking the vnode.

Reviewed by: alc
Sponsored by: EMC / Isilon


# 8311a2b8 24-Jan-2015 Will Andrews <will@FreeBSD.org>

Add vm.panic_on_oom sysctl, which enables those who would rather panic than
kill a process, when the system runs out of memory. Defaults to off.

Usually, this is most useful when the OOM condition is due to mismanagement
of memory, on a system where the applications in question don't respond well
to being killed.

In theory, if the system is properly managed, it shouldn't be possible to
hit this condition. If it does, the panic can be more desirable for some
users (since it can be a good means of finding the root cause) rather than
killing the largest process and continuing on its merry way.

As kib@ mentions in the differential, there is also protect(1), which uses
procctl(PROC_SPROTECT) to ensure that some processes are immune. However,
a panic approach is still useful in some environments. This is primarily
intended as a development/debugging tool.

Differential Revision: D1627
Reviewed by: kib
MFC after: 1 week


# 71943c3d 24-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

Avoid calling vmspace_free() while owning the process lock. Freeing
of an vm space may require obtaining sleepable locks. Hold the
process to keep the pointer valid, and change trylock to lock, since
there is no longer two process locks owned simultaneously in
vm_pageout_oom().

Note that after the process lock is dropped, process might exec, and
no longer qualify as the owner of biggest vm space.

In collaboration with: rstone
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 14a0d74e 03-Oct-2014 Steven Hartland <smh@FreeBSD.org>

Refactor ZFS ARC reclaim checks and limits

Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR: 187594
Review: D702
Reviewed by: avg
MFC after: 1 week
X-MFC-With: r270759 & r270861
Sponsored by: Multiplay


# f721133e 24-Sep-2014 Steven Hartland <smh@FreeBSD.org>

Fix ticks wrap issue of lowmem test in vm_pageout_scan

Reviewed by: jhb (D818)
MFC after: 3 days
Sponsored by: Multiplay


# 4d19f4ad 28-Aug-2014 Steven Hartland <smh@FreeBSD.org>

Refactor ZFS ARC reclaim logic to be more VM cooperative

Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).

This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.

The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

Credit to Karl Denninger for the original patch on which this update was
based.

PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay


# 9452b5ed 26-Aug-2014 Alan Cox <alc@FreeBSD.org>

Back in the days when the kernel was single threaded, testing
"vm_paging_target() > 0" was a reasonable way of determining if the
inactive queue scan met its target. However, now that other threads
can be allocating pages while the inactive queue scan is running, it's
an unreliable method. The effect of it being unreliable is that we
can start swapping out processes when we didn't intend to.

This issue has existed since the kernel was multithreaded, but the
changes to the inactive queue target in 10.0-RELEASE have made its
effects visible.

This change introduces a more direct method for determining if the
inactive queue scan met its target that is not affected by the actions
of other threads.

Reported by: Steve Polyack
Tested by: pho, Steve Polyack (an earlier version)
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# e7d939bd 06-Jul-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# 60196cda 05-May-2014 Alan Cox <alc@FreeBSD.org>

Prior to r254304, a separate function, vm_pageout_page_stats(), was used to
periodically update the reference status of the active pages. This function
was called, instead of vm_pageout_scan(), when memory was not scarce. The
objective was to provide up to date reference status for active pages in
case memory did become scarce and active pages needed to be deactivated.

The active page queue scan performed by vm_pageout_page_stats() was
virtually identical to that performed by vm_pageout_scan(), and so r254304
eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is
called with the parameter "pass" set to zero. The intention was that when
pass is zero, vm_pageout_scan() would only scan the active queue. However,
the variable page_shortage can still be greater than zero when memory is not
scarce and vm_pageout_scan() is called with pass equal to zero.
Consequently, the inactive queue may be scanned and dirty pages laundered
even though that was not intended by r254304. This revision fixes that.

Reported by: avg
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# ab46f63e 20-Jan-2014 John Baldwin <jhb@FreeBSD.org>

Fix a couple of typos.


# 86fa2471 18-Jan-2014 Alan Cox <alc@FreeBSD.org>

Style changes in vm_pageout_scan():

1. Be consistent in the style of "act_delta" manipulations between the
inactive and active queue scans.

2. Explicitly compare to zero.

3. The deactivation of a page is based is based on its recent history
and not just the current call to vm_pageout_scan(). The variable
"act_delta" represents the current state of the page, and not its
history. Avoid possible confusion by not (ab)using "act_delta" for
the making the deactivation decision.

Submitted by: kib [1]
Reviewed by: kib [2,3]


# 9099545a 12-Jan-2014 Alan Cox <alc@FreeBSD.org>

Correctly update the count of stuck pages, "addl_page_shortage", in
vm_pageout_scan(). There were missing increments in two less common cases.

Don't conflate the count of stuck pages and the pageout deficit provided by
vm_page_alloc{,_contig}(). (A proposed fix to the OOM code depends on this.)

Handle held pages consistently in the inactive queue scan. In the more
common case, we did not move the page to the tail of the queue. Whereas, in
the less common case, we did. There's no particular reason to move the page
in the less common case, so remove it.

Perform the calculation of the page shortage for the active queue scan a
little earlier, before the active queue lock is acquired. The correctness
of this calculation doesn't depend on the active queue lock being held.

Eliminate a redundant variable, "pcount". Use the more descriptive
variable, "maxscan", in its place.

Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess
parentheses.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 938b0f5b 25-Dec-2013 Marcel Moolenaar <marcel@FreeBSD.org>

For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is
that we don't have a good way (yet) to iterate over the mapped pages by
virtual address and simply try each page within the range. Given that we
call pmap_remove() over the entire 2^63 bytes of address space, it takes
a while for pmap_remove to have tried all 2^50 pages.
By using pmap_remove_pages() we use the PV list to find all mappings.

Change derived from a patch by: alc


# d395270d 25-Dec-2013 Dimitry Andric <dim@FreeBSD.org>

In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as
argument, cast the incoming 0 argument to void *, to silence a warning
from clang 3.4 ("expression which evaluates to zero treated as a null
pointer constant of type 'void *' [-Wnon-literal-null-conversion]").

MFC after: 3 days


# 1bd7d0b7 09-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

If filesystem declares that it supports shared locking for writes, use
shared vnode lock for VOP_PUTPAGES() as well. The only such
filesystem in the tree is ZFS, and it uses
vnode_pager_generic_putpages(), which performs the pageout with
VOP_WRITE().

Reviewed by: alc
Discussed with: avg
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 274132ac 21-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

- Eliminate the vm object lock from the active queue scan. It is not
necessary since we do not free or cache the page from active anymore.
Document the one possible race that is harmless.

Sponsored by: EMC / Isilon Storage Division
Discussed with: alc


# c9612b2d 19-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

- Increase the active lru refresh interval to 10 minutes. This has been
shown to negatively impact some workloads and the goal is only to
eliminate worst case behaviors for very long periods of paging
inactivity. Eventually we should determine a more complex scaling
factor for this feature.
- Rate limit low memory callback handlers to limit thrashing. Set the
default to 10 seconds.

Sponsored by: EMC / Isilon Storage Division


# 949c9186 17-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the arbitrary binding of the pagedaemon threads to the domains,
update the comment accordingly and make it more precise.

Requested and reviewed by: jeff (previous version)


# 114f62c6 15-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

- Fix bug in r254304. Use the ACTIVE pq count for the active list
processing, not inactive. This was the result of a bad merge.

Reported by: pho
Sponsored by: EMC / Isilon Storage Division


# d9e23210 13-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

- Change v_free_target to include the quantity previously represented by
v_cache_min so we don't need to add them together everywhere we use them.
- Add a pageout_wakeup_thresh that sets the free page count trigger for
waking the page daemon. Set this 10% above v_free_min so we wakeup before
any phase transitions in vm users.
- Adjust down v_free_target now that we're willing to accept more pagedaemon
wakeups. This means we process fewer pages in one iteration as well,
leading to shorter lock hold times and less overall disruption.
- Eliminate vm_pageout_page_stats(). This was a minor variation on the
PQ_ACTIVE segment of the normal pageout daemon. Instead we now process
1 / vm_pageout_update_period pages every second. This causes us to visit
the whole active list every 60 seconds. Previously we would only maintain
the active LRU when we were short on pages which would mean it could be
woefully out of date.

Reviewed by: alc (slight variant of this)
Discussed with: alc, kib, jhb
Sponsored by: EMC / Isilon Storage Division


# c325e866 10-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 449c2e92 07-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# bb7858ea 26-Jul-2013 Jeff Roberson <jeff@FreeBSD.org>

Improve page LRU quality and simplify the logic.

- Don't short-circuit aging tests for unmapped objects. This biases
against unmapped file pages and transient mappings.
- Always honor PGA_REFERENCED. We can now use this after soft busying
to lazily restart the LRU.
- Don't transition directly from active to cached bypassing the inactive
queue. This frees recently used data much too early.
- Rename actcount to act_delta to be more consistent with use and meaning.

Reviewed by: kib, alc
Sponsored by: EMC / Isilon Storage Division


# 90776bd7 23-Jul-2013 Jeff Roberson <jeff@FreeBSD.org>

- Remove the long obsolete 'vm_pageout_algorithm' experiment.

Discussed with: alc
Sponsored by: EMC / Isilon Storage Division


# e23b0a19 03-Jun-2013 Alan Cox <alc@FreeBSD.org>

Relax the object locking in vm_pageout_map_deactivate_pages() and
vm_pageout_object_deactivate_pages(). A read lock suffices.

Sponsored by: EMC / Isilon Storage Division


# a15f7df5 08-Apr-2013 Attilio Rao <attilio@FreeBSD.org>

The per-page act_count can be made very-easily protected by the
per-page lock rather than vm_object lock, without any further overhead.
Make the formal switch.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 53636869 27-Jan-2013 Andrey Zonov <zont@FreeBSD.org>

- Add sysctls to show number of stats scans.

MFC after: 2 weeks


# 4a365329 27-Jan-2013 Andrey Zonov <zont@FreeBSD.org>

- Style.

MFC after: 2 weeks


# 28634820 08-Dec-2012 Alan Cox <alc@FreeBSD.org>

In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages. In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by: kib


# 8d220203 12-Nov-2012 Alan Cox <alc@FreeBSD.org>

Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by: kib


# 9fc4739d 01-Nov-2012 Alan Cox <alc@FreeBSD.org>

In general, we call pmap_remove_all() before calling vm_page_cache(). So,
the call to pmap_remove_all() within vm_page_cache() is usually redundant.
This change eliminates that call to pmap_remove_all() and introduces a
call to pmap_remove_all() before vm_page_cache() in the one place where
it didn't already exist.

When iterating over a paging queue, if the object containing the current
page has a zero reference count, then the page can't have any managed
mappings. So, a call to pmap_remove_all() is pointless.

Change a panic() call in vm_page_cache() to a KASSERT().

MFC after: 6 weeks


# a406d8c3 28-Oct-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove useless check; vm_pindex_t is unsigned on all architectures.

CID: 3701
Found with: Coverity Prevent


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 7ecfabc7 13-Oct-2012 Alan Cox <alc@FreeBSD.org>

Move vm_page_requeue() to the only file that uses it.

MFC after: 3 weeks


# 9af47af6 13-Oct-2012 Alan Cox <alc@FreeBSD.org>

Eliminate the conditional for releasing the page queues lock in
vm_page_sleep(). vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held. These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after: 3 weeks


# 1f11f2bf 23-Sep-2012 Alan Cox <alc@FreeBSD.org>

Address a race condition that was introduced in r238212. Unless the page
queues lock is acquired before the page lock is released, there is no
guarantee that the page will still be in that same page queue when
vm_page_requeue() is called.

Reported by: pho
In collaboration with: kib
MFC after: 3 days


# 96240c89 14-Sep-2012 Eitan Adler <eadler@FreeBSD.org>

Correct double "the the"

Approved by: cperciva
MFC after: 3 days


# 18f55b01 06-Aug-2012 Alan Cox <alc@FreeBSD.org>

Never sleep on busy pages in vm_pageout_launder(), always skip them. Long
ago, sleeping on busy pages in vm_pageout_launder() made sense. The call
to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages
blocked vm_pageout_launder() until the flush had completed. However, in
CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was
changed to request synchronous I/O, but the sleep on busy pages was not
removed.


# 311e34e2 26-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not requeue held page or page for which locking failed, just leave
them alone.

Process the act_count updates for the held pages in the vm_pageout
loop over the inactive queue, instead of refusing to do anything with
such page.

Clarify the intent of the addl_page_shortage counter and change its
use for pages which are not processed in the loop according to the
description.

Reviewed by: alc
MFC after: 2 weeks


# 2cba1ccd 23-Jul-2012 Alan Cox <alc@FreeBSD.org>

Addendum to r238604. If the inactive queue scan isn't restarted, then
the variable "addl_page_shortage_init" isn't needed.

X-MFC after: r238604


# d4961bcb 18-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not restart scan of the inactive queue when non-inactive page is
found. Rather, we shall not find such pages on inactive queue at all.

Requested and reviewed by: alc
MFC after: 2 weeks


# 85eeca35 17-Jul-2012 Alan Cox <alc@FreeBSD.org>

Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar
code resides. Rename vm_contig_grow_cache() to vm_pageout_grow_cache().

Reviewed by: kib


# a4156419 08-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Avoid vm page queues lock leak after r238212.

Reported and tested by: Michael Butler <imb protected-networks net>
Reviewed by: alc
Pointy hat to: kib
MFC after: 20 days


# 48cc2fc77 07-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Drop page queues mutex on each iteration of vm_pageout_scan over the
inactive queue, unless busy page is found.

Dropping the mutex often should allow the other lock acquires to
proceed without waiting for whole inactive scan to finish. On machines
with lot of physical memory scan often need to iterate a lot before it
finishes or finds a page which requires laundring, causing high
latency for other lock waiters.

Suggested and reviewed by: alc
MFC after: 3 weeks


# 5d10ef20 06-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Style.

Reviewed by: alc (previous version)
MFC after: 1 week


# 6031c68d 16-Jun-2012 Alan Cox <alc@FreeBSD.org>

The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by: kib
MFC after: 6 weeks


# 7900f95d 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Assert that fictitious or unmanaged pages do not appear on
active/inactive lists.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 126d6082 17-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks


# 8d01a3b2 16-Jan-2012 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Revert r212360 now that PowerPC can handle large sparse arguments to
pmap_remove() (changed in r228412).

MFC after: 2 weeks


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# afcc55f3 06-Jul-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# a8229fa3 02-Jul-2011 Alan Cox <alc@FreeBSD.org>

Initialize marker pages as held rather than fictitious/wired. Marking the
page as held is more useful as a safety precaution in case someone forgets
to check for PG_MARKER.

Reviewed by: kib


# f497cda2 06-Apr-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

In vm_daemon(), do not skip processes stopped with SIGSTOP.


# 099e7e95 06-Apr-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add RACCT_RSS.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 8e6fa660 24-Mar-2011 John Baldwin <jhb@FreeBSD.org>

Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week


# 3fccbe43 18-Mar-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

In vm_daemon(), when iterating over all processes in the system, skip those
which are not yet fully initialized (i.e. ones with p_state == PRS_NEW).
Without it, we could panic in _thread_lock_flags().

Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that
require similar fix.

Reported by: pho, keramida
Discussed with: kib


# 4c6a2e7a 16-Jan-2011 Alan Cox <alc@FreeBSD.org>

Shift responsibility for synchronizing access to the page's act_count
field to the object's lock.

Reviewed by: kib@


# 17f6a17b 02-Jan-2011 Alan Cox <alc@FreeBSD.org>

Release the page lock early in vm_pageout_clean(). There is no reason to
hold this lock until the end of the function.

With the aforementioned change to vm_pageout_clean(), page locks don't need
to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions.

Reviewed by: kib


# 1e8a675c 18-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

vm_pageout_flush() might cache the pages that finished write to the
backing storage. Such pages might be then reused, racing with the
assert in vm_object_page_collect_flush() that verified that dirty
pages from the run (most likely, pages with VM_PAGER_AGAIN status) are
write-protected still. In fact, the page indexes for the pages that
were removed from the object page list should be ignored by
vm_object_page_clean().

Return the length of successfully written run from vm_pageout_flush(),
that is, the count of pages between requested page and first page
after requested with status VM_PAGER_AGAIN. Supply the requested page
index in the array to vm_pageout_flush(). Use the returned run length
to forward the index of next page to clean in vm_object_page_clean().

Reported by: avg
Reviewed by: alc
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 42768fec 09-Sep-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

On architectures with non-tree-based page tables like PowerPC, every page
in a range must be checked when calling pmap_remove(). Calling
pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range
of the map could result in attempting to demap an extraordinary number
of pages (> 10^15), so iterate through each map entry and unmap each of
them individually.

MFC after: 6 weeks


# cdaba1f2 02-Jul-2010 Alan Cox <alc@FreeBSD.org>

Push down the acquisition of the page queues lock into
vm_pageout_page_stats(). In particular, avoid acquiring the page
queues lock unless iterating over the active queue.


# 9cf51988 02-Jul-2010 Alan Cox <alc@FreeBSD.org>

With the demise of page coloring, the page queue macros no longer serve any
useful purpose. Eliminate them.

Reviewed by: kib


# 95976f3f 30-Jun-2010 Alan Cox <alc@FreeBSD.org>

Simplify entry to vm_pageout_clean(). Expect the page to be locked.
Previously, the caller unlocked the page, and vm_pageout_clean()
immediately reacquired the page lock. Also, assert rather than test
that the page is neither busy nor held. Since vm_pageout_clean() is
called with the object and page locked, the page can't have changed
state since the caller verified that the page is neither busy nor
held.


# 91b4f427 21-Jun-2010 Alan Cox <alc@FreeBSD.org>

Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean(). When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by: kib@
MFC after: 3 weeks


# 9ee2165f 14-Jun-2010 Alan Cox <alc@FreeBSD.org>

Eliminate checks for a page having a NULL object in vm_pageout_scan()
and vm_pageout_page_stats(). These checks were recently introduced by
the first page locking commit, r207410, but they are not needed. At
the same time, eliminate some redundant accesses to the page's object
field. (These accesses should have neen eliminated by r207410.)

Make the assertion in vm_page_flag_set() stricter. Specifically, only
managed pages should have PG_WRITEABLE set.

Add a comment documenting an assertion to vm_page_flag_clear().

It has long been the case that fictitious pages have their wire count
permanently set to one. Add comments to vm_page_wire() and
vm_page_unwire() documenting this. Add assertions to these functions
as well.

Update the comment describing vm_page_unwire(). Much of the old
comment had little to do with vm_page_unwire(), but a lot to do with
_vm_page_deactivate(). Move relevant parts of the old comment to
_vm_page_deactivate().

Only pages that belong to an object can be paged out. Therefore, it
is pointless for vm_page_unwire() to acquire the page queues lock and
enqueue such pages in one of the paging queues. Generally speaking,
such pages are immediately freed after the call to vm_page_unwire().
Previously, it was the call to vm_page_free() that reacquired the page
queues lock and removed these pages from the paging queues. Now, we
will never acquire the page queues lock for this case. (It is also
worth noting that since both vm_page_unwire() and vm_page_free()
occurred with the page locked, the page daemon never saw the page with
its object field set to NULL.)

Change the panic with vm_page_unwire() to provide a more precise message.

Reviewed by: kib@


# ce186587 10-Jun-2010 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines. Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked. Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick(). (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page. Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete. Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used. Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by: kib@


# 567e51e1 24-May-2010 Alan Cox <alc@FreeBSD.org>

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# 04b7b96b 13-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206814 (by alc):
Remove a nonsensical test from vm_pageout_clean(). A page can't be in the
inactive queue and have a non-zero wire count.


# 3c4a2440 08-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.


# af394cfa 07-May-2010 Jung-uk Kim <jkim@FreeBSD.org>

Fix a typo in the previous commit.


# af4b86b9 07-May-2010 Konstantin Belousov <kib@FreeBSD.org>

One more use for vm_pageout_init_marker().

Reviewed by: alc


# 8c616246 05-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Add a helper function vm_pageout_page_lock(), similar to tegge'
vm_pageout_fallback_object_lock(), to obtain the page lock
while having page queue lock locked, and still maintain the
page position in a queue.

Use the helper to lock the page in the pageout daemon and contig launder
iterators instead of skipping the page if its lock is contested.
Skipping locked pages easily causes pagedaemon or launder to not make a
progress with page cleaning.

Proposed and reviewed by: alc


# 88389d61 03-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206264:
When OOM searches for a process to kill, ignore the processes already
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.


# 6c56db5c 02-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate an assignment that was made redundant by r207410.


# 447fe2a4 02-May-2010 Alan Cox <alc@FreeBSD.org>

Defer the acquisition of the page and page queues locks in
vm_pageout_object_deactivate_pages().


# 7bec141b 30-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

push up dropping of the page queue lock to avoid holding it in vm_pageout_flush


# e8f26319 30-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

- don't check hold_count without the page lock held
- don't leak the page lock if m->object is NULL
(assuming that that check will in fact even be valid when m->object is protected by the page lock)


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 82bfb965 29-Apr-2010 Alan Cox <alc@FreeBSD.org>

Simplify the inner loop of vm_pageout_object_deactivate_pages(). Rather
than checking each page for PG_UNMANAGED, check the vm object's type.
Only OBJT_PHYS can have unmanaged pages. Eliminate a pointless counter.
The vm object is locked, that lock is never released by the inner loop,
and the set of pages contained by the vm object is not changed by the
inner loop. Therefore, the counter serves no purpose.


# 5d4a7b79 19-Apr-2010 Alan Cox <alc@FreeBSD.org>

Eliminate an unnecessary call to pmap_remove_all(). If a page belongs to
an object whose reference count is zero, then that page cannot possibly
be mapped.


# b28889a2 18-Apr-2010 Alan Cox <alc@FreeBSD.org>

Remove a nonsensical test from vm_pageout_clean(). A page can't be in the
inactive queue and have a non-zero wire count.

Reviewed by: kib
MFC after: 3 weeks


# 3f1c4c4f 06-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

When OOM searches for a process to kill, ignore the processes already
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.

This removes the often encountered deadlock, where OOM continously
selects the same victim process, that sleeps uninterruptibly waiting
for a page. The killed process may still sleep if page cannot be
obtained immediately, but testing has shown that system has much
higher chance to survive in OOM situation with the patch.

In collaboration with: pho
Reviewed by: alc
MFC after: 4 weeks


# 02d65815 07-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r202529:
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent possible vnode reference leak.


# b9f180d1 17-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

When a vnode-backed vm object is referenced, it increments the vnode
reference count, and decrements it on dereference. If referenced object
is deallocated, object type is reset to OBJT_DEAD. Consequently, all
vnode references that are owned by object references are never released.
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent leak.

Add an assertion to the vm_pageout() to make sure that we never get
reference on the vnode but then do not execute code to release it.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


# 01381811 24-Jul-2009 John Baldwin <jhb@FreeBSD.org>

Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# 26f4eea5 23-Jun-2009 Alan Cox <alc@FreeBSD.org>

The bits set in a page's dirty mask are a subset of the bits set in its
valid mask. Consequently, there is no need to perform a bit-wise and of
the page's dirty and valid masks in order to determine which parts of a
page are dirty and valid.

Eliminate an unnecessary #include.


# b78ddb0b 28-May-2009 Alan Cox <alc@FreeBSD.org>

Revise vm_pageout_scan()'s handling of partially dirty pages. Specifically,
rather than unconditionally making partially dirty pages fully dirty, only
make partially dirty pages fully dirty if the pmap says that the page has
been modified.

(This change is also a small optimization. It eliminate an unnecessary call
to pmap_is_modified() on pages that are mapped read only.)

Suggested by: tegge


# 47916d0c 17-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate a pointless call to pmap_clear_reference() from vm_pageout_scan().
If the page belongs to an object with a reference count of zero, then it
can't have any managed mappings on which to clear a reference bit.


# 7981aa24 28-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

Use the acquired reference to the vmspace instead of direct dereferencing
of p->p_vmspace in a place where it was missed in r191277.

Noted by: pluknet gmail com


# 6bed074c 19-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

In both pageout oom handler and vm_daemon, acquire the reference to
the vmspace of the examined process instead of directly accessing its
vmspace, that may change. Also, as an optimization, check for P_INEXEC
flag before examining the process.

Reported and tested by: pho (previous version)
Reviewed by: alc
MFC after: 3 week


# f4b0c119 19-Apr-2009 Alan Cox <alc@FreeBSD.org>

Calling pmap_clear_modify() after calling pmap_remove_write() is pointless.
The latter function already clears the modified status from each of the
page's mappings.


# 6129343d 16-Nov-2008 Konstantin Belousov <kib@FreeBSD.org>

Instead of forcing vn_start_write() to reset mp back to NULL for the
failed calls with non-NULL vp, explicitely clear mp after failure.

Tested by: stass
Reviewed by: tegge
PR: 123768
MFC after: 1 week


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 2025d69b 29-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Move the code for doing out-of-memory grass from vm_pageout_scan()
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.

Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.

Reviewed by: alc
Tested by: pho, jhb
MFC after: 2 weeks


# 8d28bf04 21-Sep-2008 Alan Cox <alc@FreeBSD.org>

Prevent an integer overflow in vm_pageout_page_stats() on machines with a
large number of physical pages.

PR: 126158
Submitted by: Dmitry Tejblum
MFC after: 3 days


# 6bd9cb1c 03-Aug-2008 Tom Rhodes <trhodes@FreeBSD.org>

Fill in a few sysctl descriptions.

Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by: alc


# e5b006ff 19-Mar-2008 Alan Cox <alc@FreeBSD.org>

Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent
migration to vm/vm_page.c.


# 374ae2a3 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# da31e3aa 25-Nov-2007 Alan Cox <alc@FreeBSD.org>

Make contigmalloc(9)'s page laundering more robust. Specifically, use
vm_pageout_fallback_object_lock() in vm_contig_launder_page() to better
handle a lock-ordering problem. Consequently, trylock's failure on the
page's containing object no longer implies that the page cannot be
laundered.

MFC after: 6 weeks


# 5dfc2870 22-Nov-2007 Alan Cox <alc@FreeBSD.org>

Add a read/write sysctl for reconfiguring the maximum number of physical
pages that can be wired.

Submitted by: Eugene Grosbein
PR: 114654
MFC after: 6 weeks


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# b61ce5b0 16-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 4cd45723 15-Sep-2007 Alan Cox <alc@FreeBSD.org>

Correct an assertion in vm_pageout_flush(). Specifically, if a page's
status after vm_pager_put_pages() is VM_PAGER_PEND, then it could have
already been recycled, i.e., freed and reallocated to a new purpose;
thus, asserting that such pages cannot be written is inappropriate.

Reported by: kris
Submitted by: tegge
Approved by: re (kensmith)
MFC after: 1 week


# 14137dc0 02-Jul-2007 Alan Cox <alc@FreeBSD.org>

In the previous revision, when I replaced the unconditional acquisition
of Giant in vm_pageout_scan() with VFS_LOCK_GIANT(), I had to eliminate
the acquisition of the vnode interlock before releasing the vm object's
lock because the vnode interlock cannot be held when VFS_LOCK_GIANT() is
performed. Unfortunately, this allows the vnode to be recycled between
the release of the vm object's lock and the vget() on the vnode.

In this revision, I prevent the vnode from being recycled by acquiring
another reference to the vm object and underlying vnode before releasing
the vm object's lock.

This change also addresses another preexisting but trivial problem. By
acquiring another reference to the vm object, I also prevent the vm
object from being recycled. Previously, the "vnodes skipped" counter
could be wrong because if it examined a recycled vm object.

Reported by: kib
Reviewed by: kib
Approved by: re (kensmith)
MFC after: 3 weeks


# 97824da3 26-Jun-2007 Alan Cox <alc@FreeBSD.org>

Eliminate the use of Giant from vm_daemon(). Replace the unconditional
use of Giant in vm_pageout_scan() with VFS_LOCK_GIANT().

Approved by: re (kensmith)
MFC after: 3 weeks


# 9e897b1b 17-Jun-2007 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary checks from vm_pageout_clean(): The page that is
passed to vm_pageout_clean() cannot possibly be PG_UNMANAGED because
it came from the inactive queue and PG_UNMANAGED pages are not in any
page queue. Moreover, PG_UNMANAGED pages only exist in OBJT_PHYS
objects, and all pages within a OBJT_PHYS object are PG_UNMANAGED.
So, if the page that is passed to vm_pageout_clean() is not
PG_UNMANAGED, then it cannot be from an OBJT_PHYS object and its
neighbors from the same object cannot themselves be PG_UNMANAGED.

Reviewed by: tegge


# 2446e4f0 15-Jun-2007 Alan Cox <alc@FreeBSD.org>

Enable the new physical memory allocator.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re


# d076fbea 13-Jun-2007 Alan Cox <alc@FreeBSD.org>

Eliminate dead code: We have not performed pageouts on the kernel object
in this millenium.


# 393a081d 10-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Optimize vmmeter locking.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments

Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)


# 982d11f8 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# b4b70819 04-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# e9f995d8 06-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the
vm page queue free mutex instead of the vm page queue mutex.


# f67af5c9 17-Jan-2007 Xin LI <delphij@FreeBSD.org>

Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# 78985e42 01-Aug-2006 Alan Cox <alc@FreeBSD.org>

Complete the transition from pmap_page_protect() to pmap_remove_write().
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write(). However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.

The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm
trying to clean up and fix our support for execute access.

Reviewed by: marcel@ (ia64)


# 625e6c0a 17-Feb-2006 Tor Egge <tegge@FreeBSD.org>

Expand scope of marker to reduce the number of page queue scan restarts.


# db27dcc0 17-Feb-2006 Tor Egge <tegge@FreeBSD.org>

Check return value from nonblocking call to vn_start_write().


# 3b7db47d 04-Feb-2006 Alan Cox <alc@FreeBSD.org>

Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.


# 997e1c25 27-Jan-2006 Alan Cox <alc@FreeBSD.org>

Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)


# ef39c05b 31-Dec-2005 Alexander Leidinger <netchild@FreeBSD.org>

MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)

MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)


# 7a35a21e 09-Nov-2005 Alan Cox <alc@FreeBSD.org>

Reimplement the reclamation of PV entries. Specifically, perform
reclamation synchronously from get_pv_entry() instead of
asynchronously as part of the page daemon. Additionally, limit the
reclamation to inactive pages unless allocation from the PV entry zone
or reclamation from the inactive queue fails. Previously, reclamation
destroyed mappings to both inactive and active pages. get_pv_entry()
still, however, wakes up the page daemon when reclamation occurs. The
reason being that the page daemon may move some pages from the active
queue to the inactive queue, making some new pages available to future
reclamations.

Print the "reclaiming PV entries" message at most once per minute, but
don't stop printing it after the fifth time. This way, we do not give
the impression that the problem has gone away.

Reviewed by: tegge


# 8dbca793 09-Aug-2005 Tor Egge <tegge@FreeBSD.org>

Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE
due to the vm object being locked.

When a process writes large amounts of data to a file, the vm object associated
with that file can contain most of the physical pages on the machine. If the
process is preempted while holding the lock on the vm object, pagedaemon would
be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE
to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to
other vm objects.

Temporarily unlock the page queues lock while locking vm objects to avoid lock
order violation. Detect and handle relevant page queue changes.

This change depends on both the lock portion of struct vm_object and normal
struct vm_page being type stable.

Reviewed by: alc


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# df2e33bf 06-Jan-2005 Alan Cox <alc@FreeBSD.org>

Revise the part of vm_pageout_scan() that moves pages from the cache
queue to the free queue. With this change, if a page from the cache
queue belongs to a locked object, it is simply skipped over rather
than moved to the inactive queue.


# 34d9e6fd 04-Nov-2004 Alan Cox <alc@FreeBSD.org>

During traversal of the inactive queue, try locking the page's containing
object before accessing the page's flags or the object's reference count.


# d19ef814 03-Nov-2004 Alan Cox <alc@FreeBSD.org>

The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.


# b86e6ec0 30-Oct-2004 Alan Cox <alc@FreeBSD.org>

During traversal of the active queue by vm_pageout_page_stats(), try
locking the page's containing object before accessing the page's flags.


# 4b8a5c40 30-Oct-2004 Alan Cox <alc@FreeBSD.org>

Add an assignment statement that I omitted from the previous revision.


# b08abf6c 27-Oct-2004 Alan Cox <alc@FreeBSD.org>

During traversal of the active queue, try locking the page's containing
object before accessing the page's flags or the object's reference count.
If the trylock fails, handle the page as though it is busy.


# 3e36afbe 17-Jul-2004 Alan Cox <alc@FreeBSD.org>

Remove the GIANT_REQUIRED preceding pmap_remove() in
vm_pageout_map_deactivate_pages().


# 3d2e54c3 15-Jul-2004 Alan Cox <alc@FreeBSD.org>

Push down the acquisition and release of the page queues lock into
pmap_protect() and pmap_remove(). In general, they require the lock in
order to modify a page's pv list or flags. In some cases, however,
pmap_protect() can avoid acquiring the lock.


# 26354d4c 12-Jul-2004 Alan Cox <alc@FreeBSD.org>

Remove an unused and unimplemented sysctl. (For the record, it was marked
as unimplemented in revision 1.129 nearly six years ago.)


# 5e609009 23-Jun-2004 Alan Cox <alc@FreeBSD.org>

Call vm_pageout_page_stats() with the page queues lock held.


# 1aab16a6 23-Jun-2004 Alan Cox <alc@FreeBSD.org>

Remove spl calls.


# fa885116 15-Jun-2004 Julian Elischer <julian@FreeBSD.org>

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


# f651b129 11-May-2004 Alan Cox <alc@FreeBSD.org>

Cache queue pages are not mapped. Thus, the pmap_remove_all() by
vm_pageout_scan()'s loop for freeing cache queue pages is unnecessary.


# dcbcd518 04-Mar-2004 Bruce Evans <bde@FreeBSD.org>

Minor style fixes. In vm_daemon(), don't fetch the rss limit long before
it is needed.


# 9ea8d1a6 21-Feb-2004 Alan Cox <alc@FreeBSD.org>

Eliminate the second, unnecessary call to pmap_page_protect() near the end
of vm_pageout_flush(). Instead, assert that the page is still write
protected.

Discussed with: tegge


# 84d98bf6 14-Feb-2004 Alan Cox <alc@FreeBSD.org>

- Correct a long-standing race condition in vm_page_try_to_cache() that
could result in a panic "vm_page_cache: caching a dirty page, ...":
Access to the page must be restricted or removed before calling
vm_page_cache(). This race condition is identical in nature to that
which was addressed by vm_pageout.c's revision 1.251.
- Simplify the code surrounding the fix to this same race condition
in vm_pageout.c's revision 1.251. There should be no behavioral
change. Reviewed by: tegge

MFC after: 7 days


# a3dfacb5 10-Feb-2004 Alan Cox <alc@FreeBSD.org>

Correct a long-standing race condition in the inactive queue scan. (See
the added comment for low-level details.) The effect of this race
condition is a panic "vm_page_cache: caching a dirty page, ..."

Reviewed by: tegge
MFC after: 7 days


# 91d5354a 04-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 2e3b314d 24-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Push down Giant from vm_pageout() to vm_pageout_scan(), freeing
vm_pageout_page_stats() from Giant.
- Modify vm_pager_put_pages() and vm_pager_page_unswapped() to expect the
vm object to be locked on entry. (All of the pager routines now expect
this.)


# ab42316c 22-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from
vm_pageout_scan(). Rationale: I don't like leaving a busy page in the
cache queue with neither the vm object nor the vm page queues lock held.
- Assert that the page is active in vm_pageout_page_stats().


# d3c09dd7 21-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Assert that every page found in the active queue is an active page.


# 7a935082 18-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Increase the object lock's scope in vm_contig_launder() so that access
to the object's type field and the call to vm_pageout_flush() are
synchronized.
- The above change allows for the eliminaton of the last parameter
to vm_pageout_flush().
- Synchronize access to the page's valid field in vm_pageout_flush()
using the containing object's lock.


# 6989c456 16-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to a vm page's valid field using the containing
vm object's lock.
- Release the vm object and vm page queues locks around vput().


# 45ae1d91 18-Sep-2003 Alan Cox <alc@FreeBSD.org>

Merge vm_pageout_free_page_calc() into vm_pageout(), eliminating some
unneeded code.


# 4b5f5531 17-Sep-2003 Alan Cox <alc@FreeBSD.org>

When calling vget() on a vnode-backed vm object, acquire the vnode
interlock before releasing the vm object's lock.


# 3562af12 30-Aug-2003 Alan Cox <alc@FreeBSD.org>

- Add vm object locking to the part of vm_pageout_scan() that launders
dirty pages.
- Remove some unused variables.


# 3e1b578a 14-Aug-2003 Alan Cox <alc@FreeBSD.org>

Extend the scope of the page queues lock in vm_pageout_scan() to cover
the traversal of the PQ_INACTIVE queue.


# 8f60c087 03-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Change the layout policy of the swap_pager from a hardcoded width
striping to a per device round-robin algorithm.

Because of the policy of not attempting to retain previous swap
allocation on page-out, this means that a newly added swap device
almost instantly takes its 1/N share of the I/O load but it takes
somewhat longer for it to assume it's 1/N share of the pages if there
is plenty of space on the other devices.

Change the 8G total swapspace limitation to 8G per device instead
by using a per device blist rather than one global blist. This
reduces the memory footprint by 75% (typically a couple hundred
kilobytes) for the common case with one swapdevice but NSWAPDEV=4.

Remove the compile time constant limit of number of swap devices,
there is no limit now. Instead of a fixed size array, store the
per swapdev structure in a TAILQ.

Total swap space is still addressed by a 32 bit page number and
therefore the upper limit is now 2^42 bytes = 16TB (for i386).

We still do not allocate the first page of each device in order to
give some amount of protection to any bsdlabel at the start of the
device.

A new device is appended after the existing devices in the swap space,
no attempt is made to fill in holes left behind by swapoff (this can
trivially be changed should it ever become a problem).

The sysctl vm.nswapdev now reflects the number of currently configured
swap devices.

Rename vm_swap_size to swap_pager_avail for consistency with other
exported names.

Change argument type for vm_proc_swapin_all() and swap_pager_isswapped()
to be a struct swdevt pointer rather than an index.

Not changed: we are still using blists to manage the free space,
but since the swapspace is no longer fragmented by the striping
different resource managers might fare better.


# ecf6279f 07-Jul-2003 Alan Cox <alc@FreeBSD.org>

- Complete the vm object locking in vm_pageout_object_deactivate_pages().
- Change vm_pageout_object_deactivate_pages()'s first parameter from a
vm_map_t to a pmap_t.
- Change vm_pageout_object_deactivate_pages()'s and
vm_pageout_map_deactivate_pages()'s last parameter from a vm_pindex_t
to a long. Since the number of pages in an address space doesn't
require 64 bits on an i386, vm_pindex_t is overkill.


# 0774dfb3 29-Jun-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to vm_pageout_map_deactivate_pages().


# 5163584c 28-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Add vm object locking to vm_pageout_clean().


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# e92686d0 18-May-2003 David Schultz <das@FreeBSD.org>

If we seem to be out of VM, don't allow the pagedaemon to kill
processes in the first pass. Among other things, this will give
us a chance to launder vnode-backed pages before concluding that
we need more swap. This is particularly useful for systems that
have no swap.

While here, update a comment and remove some long-unused code.

Reported by: Lucky Green <shamrock@cypherpunks.to>
Suggested by: dillon
Approved by: re (rwatson)


# c4a1d732 04-May-2003 Alan Cox <alc@FreeBSD.org>

Avoid a lock-order reversal and implement vm_object locking
in vm_pageout_page_free().


# 85b1dc89 29-Apr-2003 Alan Cox <alc@FreeBSD.org>

Eliminate an unused parameter from vm_pageout_object_deactivate_pages().


# b6e48e03 23-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)


# 897ecacd 22-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Lock the proc to check p_flag and several other related tests in
vm_daemon(). We don't need to hold sched_lock as long now as a result.


# 72ba747d 20-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_wakeup().
- Merge two identical cases in a switch statement.


# d22bc710 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_add().


# f4cf2141 31-Mar-2003 Wes Peters <wes@FreeBSD.org>

Add a facility allowing processes to inform the VM subsystem they are
critical and should not be killed when pageout is looking for more
memory pages in all the wrong places.

Reviewed by: arch@
Sponsored by: St. Bernard Software


# 72d97679 12-Mar-2003 David Schultz <das@FreeBSD.org>

- When the VM daemon is out of swap space and looking for a
process to kill, don't block on a map lock while holding the
process lock. Instead, skip processes whose map locks are held
and find something else to kill.
- Add vm_map_trylock_read() to support the above.

Reviewed by: alc, mike (mentor)


# 6b4b77ad 09-Feb-2003 Alan Cox <alc@FreeBSD.org>

Add a comment describing how pagedaemon_wakeup() should be used and
synchronized.

Suggested by: tegge


# a1c0a785 02-Feb-2003 Alan Cox <alc@FreeBSD.org>

- It's more accurate to say that vm_paging_needed() returns TRUE
than a positive number.
- In pagedaemon_wakeup(), set vm_pages_needed to 1 rather than
incrementing it to accomplish the same.


# 8e1d8de5 01-Feb-2003 Alan Cox <alc@FreeBSD.org>

- Convert vm_pageout()'s tsleep()s to msleep()s with the page queue lock.


# 8b245767 01-Feb-2003 Alan Cox <alc@FreeBSD.org>

- Remove (some) unnecessary explicit initializations to zero.
- Style changes to vm_pageout(): declarations and white-space.


# b0ef8c5f 13-Jan-2003 Alan Cox <alc@FreeBSD.org>

- Update vm_pageout_deficit using atomic operations. It's a simple
counter outside the scope of existing locks.
- Eliminate a redundant clearing of vm_pageout_deficit.


# ff2023a5 13-Jan-2003 Alan Cox <alc@FreeBSD.org>

Make vm_pageout_page_free() static.


# c410df59 03-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Avoid extern decls in .c files by putting them in the vm/swap_pager.h
include file where they belong.
Share the dmmax_mask variable.


# 40bb4f4b 28-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

vm_pager_put_pages() takes VM_PAGER_* flags, not OBJPC_* flags. It just
so happens that OBJPC_SYNC has the same value as VM_PAGER_PUT_SYNC so no
harm done. But fix it :-)

No operational changes.

MFC after: 1 day


# b365ea9e 17-Dec-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing vm_page_flag_set().


# 38857e7f 30-Nov-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when calling pmap_protect(); it updates fields
of the vm_page structure. Nearby, remove an unnecessary semicolon and
return statement.

Approved by: re (blanket)


# 78f7187d 30-Nov-2002 Alan Cox <alc@FreeBSD.org>

Increase the scope of the page queue lock in vm_pageout_scan().

Approved by: re (blanket)


# ba0208b9 23-Nov-2002 Alan Cox <alc@FreeBSD.org>

Assert that the page queues lock rather than Giant is held in
vm_pageout_page_free().

Approved by: re


# 855a310f 21-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add an event that is triggered when the system is low on memory. This is
intended to be used by significant memory consumers so that they may drain
some of their caches.

Inspired by: phk
Approved by: re
Tested on: x86, alpha


# a12cc0e4 17-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove vm_page_protect(). Instead, use pmap_page_protect() directly.


# 4fec79be 16-Nov-2002 Alan Cox <alc@FreeBSD.org>

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


# eea85e9b 12-Nov-2002 Alan Cox <alc@FreeBSD.org>

Move pmap_collect() out of the machine-dependent code, rename it
to reflect its new location, and add page queue and flag locking.

Notes: (1) alpha, i386, and ia64 had identical implementations
of pmap_collect() in terms of machine-independent interfaces;
(2) sparc64 doesn't require it; (3) powerpc had it as a TODO.


# d154fb4f 10-Nov-2002 Alan Cox <alc@FreeBSD.org>

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


# b43179fb 11-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch


# 3ef3e7c4 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of the unused LK_NOOBJ.


# 05ba50f5 21-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 67ef391e 10-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_activate().


# 299018d3 27-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_free().
o Increment cnt.v_dfree inside vm_pageout_page_free() rather than
at each call.


# 55df3298 27-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Require that the page queues lock is held on entry to vm_pageout_clean()
and vm_pageout_flush().
o Acquire the page queues lock before calling vm_pageout_clean()
or vm_pageout_flush().


# ce18aebd 27-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_activate() and vm_page_deactivate()
in vm_pageout_object_deactivate_pages().
o Apply some style fixes to vm_pageout_object_deactivate_pages().


# 8ffc1519 22-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Extend the scope of the page queues lock in vm_pageout_scan()
to cover the traversal of the cache queue.


# 40eab1e9 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_try_to_cache(). (The accesses
in kern/vfs_bio.c are already locked.)
o Assert that the page queues lock is held in vm_page_try_to_cache().


# 15a5d210 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_cache() in vm_fault() and
vm_pageout_scan(). (The others are already locked.)
o Assert that the page queues lock is held in vm_page_cache().


# 48c0444c 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock accesses to the active page queue in vm_pageout_scan() and
vm_pageout_page_stats().


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# d974f03c 28-Apr-2002 Alan Cox <alc@FreeBSD.org>

o Introduce and use vm_map_trylock() to replace several direct uses
of lockmgr().
o Add missing synchronization to vmspace_swap_count(): Obtain a read lock
on the vm_map before traversing it.


# 670d17b5 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.


# 11caded3 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 25adb370 18-Mar-2002 Brian Feldman <green@FreeBSD.org>

Back out the modification of vm_map locks from lockmgr to sx locks. The
best path forward now is likely to change the lockmgr locks to simple
sleep mutexes, then see if any extra contention it generates is greater
than removed overhead of managing local locking state information,
cost of extra calls into lockmgr, etc.

Additionally, making the vm_map lock a mutex and respecting it properly
will put us much closer to not needing Giant magic in vm.


# 0e0af8ec 13-Mar-2002 Brian Feldman <green@FreeBSD.org>

Rename SI_SUB_MUTEX to SI_SUB_MTX_POOL to make the name at all accurate.
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.

After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system. To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.

This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe. One more
user of lockmgr down, many more to go :)


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# 7f3a4093 27-Feb-2002 Mike Silbersack <silby@FreeBSD.org>

Fix a horribly suboptimal algorithm in the vm_daemon.

In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes. Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations. The algorithm has been changed so that
each page will be checked only 16 times.

Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period. With this change
in place, the vm_daemon completes in less than a second. Any system
with hundreds of processes sharing pages should benefit from this change.

Note that the vm_daemon is only run when the system is under extreme
memory pressure. It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.

Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.

PR: 33542, 20393
Reviewed by: dillon
MFC after: 1 week


# ef6020d1 19-Feb-2002 Mike Silbersack <silby@FreeBSD.org>

Changes to make the OOM killer much more effective:

- Allow the OOM killer to target processes currently locked in
memory. These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
waiting for their page fault to complete. In order for the OOM
killer to work well, the killed process and other system processes
waiting on memory must be allowed to wakeup first.

Reviewed by: dillon
MFC after: 1 week


# 027df6bd 31-Jan-2002 Matthew Dillon <dillon@FreeBSD.org>

GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer
cache lockups for over a year now.

MFC after: 0 days


# 23b59018 20-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


# 57601bcb 21-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Syntax cleanup and documentation, no operational changes.

MFC after: 1 day


# 30105b9e 14-Oct-2001 Tor Egge <tegge@FreeBSD.org>

Don't remove all mappings of a swapped out process if the vm map contained
wired entries. vm_fault_unwire() depends on the mapping being intact.

Reviewed by: dillon


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 6d03d577 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc).
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.

vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)


# 54d92145 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

whitespace / register cleanup


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# ad6c5bbe 20-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Don't lock around swap_pager_swap_init() that is only called once during
the pagedaemon's startup code since it calls malloc which results in lock
order reversals.


# 69a78d46 19-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Put the scheduler, vmdaemon, and pagedaemon kthreads back under Giant for
now. The proc locking isn't actually safe yet and won't be until the proc
locking is finished.


# ff2b5645 09-Jun-2001 Matthew Dillon <dillon@FreeBSD.org>

Two fixes to the out-of-swap process termination code. First, start killing
processes a little earlier to avoid a deadlock. Second, when calculating
the 'largest process' do not just count RSS. Instead count the RSS + SWAP
used by the process. Without this the code tended to kill small
inconsequential processes like, oh, sshd, rather then one of the many
'eatmem 200MB' I run on a whim :-). This fix has been extensively tested on
-stable and somewhat tested on -current and will be MFCd in a few days.

Shamed into fixing this by: ps


# 3614c6fc 23-May-2001 John Baldwin <jhb@FreeBSD.org>

- Add in several asserts of vm_mtx.
- Assert Giant in vm_pageout_scan() for the vnode hacking that it does.
- Don't hold vm_mtx around vget() or vput().
- Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also,
lock curproc while setting the P_BUFEXHAUST flag.
- For now we still hold Giant for all of the vm_daemon. When process
limits are locked we will be only need Giant for swapout_procs().


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 1c58e4e5 17-May-2001 John Baldwin <jhb@FreeBSD.org>

During the code to pick a process to kill when memory is exhausted, keep
the process in question locked as soon as we find it and determine it to
be eligible until we actually kill it. To avoid deadlock, we don't block
on the process lock but skip any process that is already locked during our
search.


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 1005a129 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Convert the allproc and proctree locks from lockmgr locks to sx locks.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 8606d880 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Catch up to proc flag changes.
- Minimal proc locking.
- Use queue macros.


# 2b6b0df7 26-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# 21cd6e62 13-Dec-2000 Seigo Tanimura <tanimura@FreeBSD.org>

- If swap metadata does not fit into the KVM, reduce the number of
struct swblock entries by dividing the number of the entries by 2
until the swap metadata fits.

- Reject swapon(2) upon failure of swap_zone allocation.

This is just a temporary fix. Better solutions include:
(suggested by: dillon)

o reserving swap in SWAP_META_PAGES chunks, and
o swapping the swblock structures themselves.

Reviewed by: alfred, dillon


# c0c25570 12-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


# 02fa91d3 11-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

Be less conservative with a recently added KASSERT. Certain edge
cases with file fragments and read-write mmap's can lead to a situation
where a VM page has odd dirty bits, e.g. 0xFC - due to being dirtied by
an mmap and only the fragment (representing a non-page-aligned end of
file) synced via a filesystem buffer. A correct solution that
guarentees consistent m->dirty for the file EOF case is being
worked on. In the mean time we can't be so conservative in the
KASSERT.


# c5a44a6a 01-Dec-2000 John Baldwin <jhb@FreeBSD.org>

Protect p_stat with sched_lock.


# 553629eb 22-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 8b03c8ed 29-May-2000 Matthew Dillon <dillon@FreeBSD.org>

This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.

A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.

Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.


# 883f3caa 28-May-2000 Matthew Dillon <dillon@FreeBSD.org>

Fix bug in vm_pageout_page_stats() that always resulted in a full
scan of the active queue. This fix is not expected to have any
noticeable impact on performance.

Noticed by: Rik van Riel <riel@conectiva.com.br>


# 24964514 21-May-2000 Peter Wemm <peter@FreeBSD.org>

Checkpoint of a new physical memory backed object type, that does not
have pv_entries. This is intended for very special circumstances,
eg: a certain database that has a 1GB shm segment mapped into 300
processes. That would consume 2GB of kvm just to hold the pv_entries
alone. This would not be used on systems unless the physical ram was
available, as it's not pageable.

This is a work-in-progress, but is a useful and functional checkpoint.
Matt has got some more fixes for it that will be committed soon.

Reviewed by: dillon


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# 25db2c54 27-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

Add necessary spl protection for swapper. The problem was located by
Alfred while testing his SPLASSERT stuff. This is not a complete fix,
more protections are probably needed.


# 5929bcfa 27-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 956f3135 26-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Spelling


# 6bdfe06a 11-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# be72f788 30-Oct-1999 Alan Cox <alc@FreeBSD.org>

The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.

Reviewed by: dillon


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 90ecac61 16-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>

Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.

Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().

add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# aeea9b36 21-Aug-1999 Alan Cox <alc@FreeBSD.org>

Remove two unused variable declarations.


# bfbacbd9 16-Aug-1999 Alan Cox <alc@FreeBSD.org>

vm_pageout_clean:
Remove dead code.

Submitted by: dillon


# e929c00d 03-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 9c8b8baa 01-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


# d50c1994 26-Jun-1999 Peter Wemm <peter@FreeBSD.org>

There isn't much point waking up a daemon that hasn't existed since
softupdates came in. Try calling speedup_syncer() instead..


# 11a9f83f 23-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Make pmap_collect() an official pmap interface.


# c8da68e9 05-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Don't forcibly kill processes that are locked in-core via PHOLD - it was
just checking P_NOSWAP before.


# 51df5949 11-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Stop the mfs from trying to swap out crucial bits of the mfs
as this can lead to deadlock.
Submitted by: Mat dillon <dillon@freebsd.org>


# fe2144fd 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Eliminate a possible numerical overflow.


# b1028ad1 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# faa273d5 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.

* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)

PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).

When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.

Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.


# 9fdfe602 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.


# 7dbf82dc 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Change all manual settings of vm_page_t->dirty = VM_PAGE_BITS_ALL
to use the vm_page_dirty() inline.

The inline can thus do sanity checks ( or not ) over all cases.


# d044d7bf 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Added warning printf ( needs INVARIANTS ) when busy cache page is found
while trying to free memory.


# aaba53da 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

It is possible for a page in the cache to be busy. vm_pageout.c was not
checking for this condition while it tried to free cache pages. Fixed.


# a489f761 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Reorganized some of the low memory testing code to make it more useful.

Removed call to vm_object_collapse(), which can block. This was being
called without the pageout code holding any sort of reference on the
vm_object or vm_page_t structures being manipulated. Since this code
can block, it was possible for other kernel code to shred the state
the pageout code was assuming remained intact.

Fixed potential blocking condition in vm_pageout_page_free() ( which
could cause a deadlock in a low-memory situation ).

Currently there is a hack in-place to deal with clean filesystem meta-data
polluting the inactive page queue. John doesn't like the hack, and neither
do I.

Revamped and commented a portion of the pageout loop.

Added protection against potential memory deadlocks with OBJT_VNODE
when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL
which causes VOP_ISLOCKED() to return a less informed answer.

remove vm_pager_sync() -- none of the pagers use it any more ( the old
swapper used to. The new one does not ).


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# b0359e2c 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Add John Dyson's SYSCTL descriptions, and an export of more stats to
a sysctl hierarchy (vm.stats.*). SYSCTL descriptions are only present
in source, they do not get compiled into the binaries taking up memory.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# faa5f8d8 29-Sep-1998 Andrzej Bialecki <abial@FreeBSD.org>

Make #define NO_SWAPPING a normal kernel config option.

Reviewed by: jkh


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# d474eaaa 06-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Protect all modifications to paging_in_progress with splvm(). The i386
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.

With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)


# 427e99a0 10-Jul-1998 Alexander Langer <alex@FreeBSD.org>

Removed unnecessary test from if/else construct.

PR: 7233
Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>


# e8f36785 01-Jun-1998 John Dyson <dyson@FreeBSD.org>

Correct sleep priority.


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# be01eafd 08-Mar-1998 John Dyson <dyson@FreeBSD.org>

Quell unneeded pageout daemon activity.


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# a15403de 24-Feb-1998 John Dyson <dyson@FreeBSD.org>

Correct some severe VM tuning problems for small systems (<=16MB), and
improve tuning on larger systems. (A couple of the VM tuning params for
small systems were so badly chosen that the system could hang under load.)

The broken tuning was originaly my fault.


# e47ed70b 23-Feb-1998 John Dyson <dyson@FreeBSD.org>

Significantly improve the efficiency of the swap pager, which appears to
have declined due to code-rot over time. The swap pager rundown code
has been clean-up, and unneeded wakeups removed. Lots of splbio's
are changed to splvm's. Also, set the dynamic tunables for the
pageout daemon to be more sane for larger systems (thereby decreasing
the daemon overheadla.)


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# eaf13dd7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# b44e4b7a 24-Dec-1997 John Dyson <dyson@FreeBSD.org>

Support running with inadequate swap space. Additionally, the code
will complain with a suggestion of increasing it.


# ceb0cf87 05-Dec-1997 John Dyson <dyson@FreeBSD.org>

Support an optional, sysctl enabled feature of idle process swapout. This
is apparently useful for large shell systems, or systems with long running
idle processes. To enable the feature:

sysctl -w vm.swap_idle_enabled=1

Please note that some of the other vm sysctl variables have been renamed
to be more accurate.
Submitted by: Much of it from Matt Dillon <dillon@best.net>


# 70111b90 04-Dec-1997 John Dyson <dyson@FreeBSD.org>

Add new (very useful) tunable for pageout daemon. The flag changes
the maximum pageout rate:

sysctl -w vm.vm_maxlaunder=n

1 < n < inf.

If paging heavily on large systems, it is likely that a performance
improvement can be achieved by increasing the parameter. On a large
system, the parm is 32, but numbers as large as 128 can make a big
difference. If paging is expensive, you might try decreasing the
number to 1-8.


# 12ac6a1d 04-Dec-1997 John Dyson <dyson@FreeBSD.org>

Support applications that need to resist or deny use of swap space.

sysctl -w vm.defer_swap_pageouts=1
Causes the system to resist the use of swap space. In low memory
conditions, performance will decrease.
sysctl -w vm.disable_swap_pageouts=1
Causes the system to mostly disable the use of swap space. In
low memory conditions, the system will likely start killing
processes.


# 5985940e 24-Oct-1997 John Dyson <dyson@FreeBSD.org>

Support garbage collecting the pmap pv entries. The management doesn't
happen until the system would have nearly failed anyway, so no signficant
overhead is added. This helps large systems with lots of processes.


# 7e006499 05-Oct-1997 John Dyson <dyson@FreeBSD.org>

Improve management of pages moving from the inactive to active queue. Additionally,
add some much needed comments.


# 79624e21 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# dc2efb27 26-Jul-1997 John Dyson <dyson@FreeBSD.org>

Add the ability for the pageout daemon to measure stats on memory usage before
the system is out of memory. The daemon does a minimal amount of work that
increases as the system becomes more likely to run out of memory and page in/out.

The default tuning is fairly low in background CPU usage, and sysctl variables
have been added to enable flexable operation. This is an experimental feature
that will likely be changed and improved over time.


# 2f558c3e 27-Feb-1997 Bruce Evans <bde@FreeBSD.org>

Removed a wrong LK_INTERLOCK flag.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# afa07f7e 15-Jan-1997 John Dyson <dyson@FreeBSD.org>

Change the map entry flags from bitfields to bitmasks. Allows
for some code simplification.


# 16a02c11 15-Jan-1997 Bruce Evans <bde@FreeBSD.org>

Removed redundant spl0()'s from kernel processes. They were work-arounds
for a bug in fork().


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# d4a272db 11-Jan-1997 John Dyson <dyson@FreeBSD.org>

Slightly correct the code that moves pages from the active to the
inactive queue. This is only a minor performance improvement, but will
not affect perf on machines that don't have ref bits.


# 106031ef 03-Jan-1997 John Dyson <dyson@FreeBSD.org>

Undo the collapse breakage (swap space usage problem.)


# 3c018e72 31-Dec-1996 John Dyson <dyson@FreeBSD.org>

Guess what? We left alot of the old collapse code that is not needed
anymore with the "full" collapse fix that we added about 1yr ago!!! The
code has been removed by optioning it out for now, so we can put it back
in ASAP if any problems are found.


# e0c5a895 28-Nov-1996 John Dyson <dyson@FreeBSD.org>

Make the kernel smaller with at worst a neutral effect on perf by
de-inlining some VM calls. (Actually, I measured a small improvement.)


# a2f4a846 27-Sep-1996 John Dyson <dyson@FreeBSD.org>

Reviewed by:
Submitted by:
Obtained from:


# 5070c7f8 08-Sep-1996 John Dyson <dyson@FreeBSD.org>

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 502ba6e4 07-Jul-1996 John Dyson <dyson@FreeBSD.org>

Back-off on the previous commit, specifically remove the look-ahead
optimization on the active queue scan. I will do this correctly later.


# c8c4b40c 07-Jul-1996 John Dyson <dyson@FreeBSD.org>

Fix a problem with the pageout daemon RSS limiting, where it degrades
performance to LRU or worse when RSS limiting takes effect. Also,
make an end condition in the active queue scan more efficient in the
case where pages are removed from the active queue as a side effect
of a pmap operation.


# 01155bd7 29-Jun-1996 David Greenman <dg@FreeBSD.org>

Make sure we have an object in the map entry before trying to trim pages
from it.


# 38efa82b 25-Jun-1996 John Dyson <dyson@FreeBSD.org>

This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.

Added some sysctls that help with VM system tuning.

New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.

vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.

Please give me feedback on the performance results
associated with these.

2) Enable/disable swapping.
vm.swapping_enabled = 1, default.

vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.

The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".

Each of these can be changed "on the fly."


# a001376d 23-Jun-1996 John Dyson <dyson@FreeBSD.org>

Remove RSS limiting until I rewrite the code to be non-recursive. The
code can overrun the kernel stack under very stressful conditions.


# ef743ce6 16-Jun-1996 John Dyson <dyson@FreeBSD.org>

Several bugfixes/improvements:
1) Make it much less likely to miss a wakeup in vm_page_free_wakeup
2) Create a new entry point into pmap: pmap_ts_referenced, eliminates
the need to scan the pv lists twice in many cases. Perhaps there
is alot more to do here to work on minimizing pv list manipulation
3) Minor improvements to vm_pageout including the use of pmap_ts_ref.
4) Major changes and code improvement to pmap. This code has had
several serious bugs in page table page manipulation. In order
to simplify the problem, and hopefully solve it for once and all,
page table pages are no longer "managed" with the pv list stuff.
Page table pages are only (mapped and held/wired) or
(free and unused) now. Page table pages are never inactive,
active or cached. These changes have probably fixed the
hold count problems, but if they haven't, then the code is
simpler anyway for future bugfixing.
5) The pmap code has been sorely in need of re-organization, and I
have taken a first (of probably many) steps. Please tell me
if you have any ideas.


# f35329ac 30-May-1996 John Dyson <dyson@FreeBSD.org>

This commit is dual-purpose, to fix more of the pageout daemon
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.


# 545901f7 29-May-1996 John Dyson <dyson@FreeBSD.org>

Correct some unfortunately chosen constants, otherwise, not enough
pages are calculated for deferred allocation of swap pager data structures.
This is a follow-on to the previous commit to this file.


# b182ec9e 28-May-1996 John Dyson <dyson@FreeBSD.org>

After careful review by David Greenman and myself, David had found a
case where blocking can occur, thereby giving other process's a chance
to modify the queue where a page resides. This could cause numerous
process and system failures.


# 85a376eb 26-May-1996 John Dyson <dyson@FreeBSD.org>

Fix a couple of problems in the pageout_scan routine. First, there is
a condition when blocking can occur, and the daemon did not check properly
for a page remaining on the expected queue. Additionally, the inactive
target was being set much too large for small memory machines. It is now
being calculated based upon the amount of user memory available on every
pageout daemon run. Another problem was that if memory was very low, the
pageout daemon could fail repeatedly to traverse the inactive queue.


# 1eeaa1e3 23-May-1996 John Dyson <dyson@FreeBSD.org>

Add apparently needed splvm protection to the active queue, and eliminate
an unnecessary test for dirty pages if it is already known to be dirty.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# bd105bb7 11-Apr-1996 Bruce Evans <bde@FreeBSD.org>

Fixed a spl hog. The vmdaemon process ran entirely at splhigh. It
sometimes disabled clock interrupts for 60 msec or more on a P133.
Clock interrupts were lost ...

Reviewed by: dyson


# 30dcfc09 27-Mar-1996 John Dyson <dyson@FreeBSD.org>

VM performance improvements, and reorder some operations in VM fault
in anticipation of a fix in pmap that will allow the mlock system call to work
without panicing the system.


# 8169788f 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 1b67ec6d 10-Mar-1996 Jeffrey Hsu <hsu@FreeBSD.org>

For Lite2: proc LIST changes.
Reviewed by: davidg & bde


# 6ac5bfdb 08-Mar-1996 John Dyson <dyson@FreeBSD.org>

Fix a calculation for a paging parameter.


# 5afce282 22-Feb-1996 David Greenman <dg@FreeBSD.org>

Add a "NO_SWAPPING" option to disable swapping. This was originally done
to help diagnose a problem on wcarchive (where the kernel stack was
sometimes not present), but is useful in its own right since swapping
actually reduces performance on some systems (such as wcarchive).
Note: swapping in this context means making the U pages pageable and has
nothing to do with generic VM paging, which is unaffected by this option.

Reviewed by: <dyson>


# 729b1e51 30-Jan-1996 David Greenman <dg@FreeBSD.org>

Improved killproc() log message and made it and the other similar message
tolerant of p_ucred being invalid. Starting using killproc() where
appropriate.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 3af76890 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused vars & funcs, make things static, protoize a little bit.


# aef922f5 05-Nov-1995 John Dyson <dyson@FreeBSD.org>

Greatly simplify the msync code. Eliminate complications in vm_pageout
for msyncing. Remove a bug that manifests itself primarily on NFS
(the dirty range on the buffers is not set on msync.)


# a91c5a7e 22-Oct-1995 John Dyson <dyson@FreeBSD.org>

Get rid of machine-dependent NBPG and replace with PAGE_SIZE.


# cd41fc12 07-Oct-1995 David Greenman <dg@FreeBSD.org>

Fix argument passing to the "freeer" routine. Added some prototypes. (bde)
Moved extern declaration of swap_pager_full into swap_pager.h and out of
the various files that reference it. (davidg)

Submitted by: bde & davidg


# a5eb0e27 06-Oct-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Avoid a 64bit divide.


# 4590fd3a 09-Sep-1995 David Greenman <dg@FreeBSD.org>

Fixed init functions argument type - caddr_t -> void *. Fixed a couple of
compiler warnings.


# 2b14f991 28-Aug-1995 Julian Elischer <julian@FreeBSD.org>

Reviewed by: julian with quick glances by bruce and others
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..

NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..

certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)

The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 6306c897 10-Jul-1995 David Greenman <dg@FreeBSD.org>

swapout_threads() -> swapout_procs().


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 61f5d510 21-May-1995 David Greenman <dg@FreeBSD.org>

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# ee3a64c9 10-May-1995 David Greenman <dg@FreeBSD.org>

Changed "handle" from type caddr_t to void *; "handle" is several different
types of pointers, and "char *" is a bad choice for the type.


# 4c1f8ee9 17-Apr-1995 David Greenman <dg@FreeBSD.org>

Fixed a logic bug that caused the vmdaemon to not wake up when intended.

Submitted by: John Dyson


# 7c0414d0 16-Apr-1995 David Greenman <dg@FreeBSD.org>

Removed obsolete/unused variable declarations. Killed externs and included
appropriate include files.


# c3cb3e12 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 17c4c408 27-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed typo...using wrong variable in page_shortage calculation.


# 0bb3a0d2 27-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed "pages freed by daemon" statistic (again).


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 61ca29b0 12-Mar-1995 David Greenman <dg@FreeBSD.org>

Deleted vm_object_setpager().


# f919ebde 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# 4f9fb771 25-Feb-1995 Bruce Evans <bde@FreeBSD.org>

Don't use __P(()) in a function definition.


# 6f2b142e 22-Feb-1995 David Greenman <dg@FreeBSD.org>

vm_page.c:
Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding
if the caller is "important" in vm_page_alloc(). Also established a new
low threshold for non-interrupt allocations via cnt.v_interrupt_free_min.

vm_pageout.c:
Various algorithmic cleanup. Some calculations simplified. Initialize
cnt.v_interrupt_free_min to 2 pages.

Submitted by: John Dyson


# c0503609 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Only do object paging_in_progress wakeups if someone is waiting on this
condition.

Submitted by: John Dyson


# 0c1dacbc 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Moved ACT_MAX, ACT_ADVANCE, and ACT_DECLINE to vm_page.h.


# d2fc5315 13-Feb-1995 Poul-Henning Kamp <phk@FreeBSD.org>

YF fix.


# 1ed81ef2 09-Feb-1995 David Greenman <dg@FreeBSD.org>

Minor algorithmic adjustments that reduce the CPU consumption of the
pagedaemon in half while not reducing its effectiveness.

Submitted by: me & John


# a1f6d91c 02-Feb-1995 David Greenman <dg@FreeBSD.org>

swap_pager.c:
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.

vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).

vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.

kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.

vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.

vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.

vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).

vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.

Submitted by: John Dyson


# 8f895206 27-Jan-1995 David Greenman <dg@FreeBSD.org>

Completed the fix for attempting to page out pages via the device_pager.

Submitted by: John Dyson


# 6d40c3d3 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# 480dff54 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed some formatting weirdness that I overlooked in the previous commit.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 832f3afd 02-Jan-1995 Andreas Schulz <ats@FreeBSD.org>

Submitted by: Ben Jackson
just a missing newline in a kernel printf added.


# eadf9e27 25-Nov-1994 David Greenman <dg@FreeBSD.org>

These changes fix a couple of lingering VM problems:

1. The pageout daemon used to block under certain
circumstances, and we needed to add new functionality
that would cause the pageout daemon to block more often.
Now, the pageout daemon mostly just gets rid of pages
and kills processes when the system is out of swap.
The swapping, rss limiting and object cache trimming
have been folded into a new daemon called "vmdaemon".
This new daemon does things that need to be done for
the VM system, but can block. For example, if the
vmdaemon blocks for memory, the pageout daemon
can take care of it. If the pageout daemon had
blocked for memory, it was difficult to handle
the situation correctly (and in some cases, was
impossible).

2. The collapse problem has now been entirely fixed.
It now appears to be impossible to accumulate unnecessary
vm objects. The object collapsing now occurs when ref counts
drop to one (where it is more likely to be more simple anyway
because less pages would be out on disk.) The original
fixes were incomplete in that pathological circumstances
could still be contrived to cause uncontrolled growth
of swap. Also, the old code still, under steady state
conditions, used more swap space than necessary. When
using the new code, users will generally notice a
significant decrease in swap space usage, and theoretically,
the system should be leaving fewer unused pages around
competing for memory.

Submitted by: John Dyson


# 79221631 16-Nov-1994 David Greenman <dg@FreeBSD.org>

Don't ever try to kill off process 1 - even if we are out of swap space
and it's the candidate pig.


# b0150bfc 13-Nov-1994 David Greenman <dg@FreeBSD.org>

Set laundry flag when transitioning an inactive page from clean to dirty.
This fixes a performance bug where pages would sometimes not be paged
out when they could be.

Submitted by: John Dyson


# 2fe6e4d7 05-Nov-1994 David Greenman <dg@FreeBSD.org>

Added support for starting the experimental "vmdaemon" system process.
Enabled via REL2_1.

Added support for doing object collapses "on the fly". Enabled via REL2_1a.

Improved object collapses so that they can happen in more cases. Improved
sensing of modified pages to fix an apparant race condition and improve
clustered pageout opportunities. Fixed an "oops" with not restarting page
scan after a potential block in vm_pageout_clean() (not doing this can result
in strange behavior in some cases).

Submitted by: John Dyson & David Greenman


# 191ee5b3 24-Oct-1994 David Greenman <dg@FreeBSD.org>

#if 0'd out the object cache trimming code - there are multiple ways
that the pageout daemon can deadlock otherwise.

Submitted by: John Dyson


# e8fbe458 23-Oct-1994 David Greenman <dg@FreeBSD.org>

Fixed object cache trimming policy so it actually works.

Submitted by: John Dyson


# 389918ee 23-Oct-1994 David Greenman <dg@FreeBSD.org>

Adjusted reserved levels to fix a deadlock condition.

Submitted by: John Dyson


# 5663e6de 21-Oct-1994 David Greenman <dg@FreeBSD.org>

Various changes to allow operation without any swapspace configured. Note
that this is intended for use only in floppy situations and is done at
the sacrifice of performance in that case (in ther words, this is not the
best solution, but works okay for this exceptional situation).

Submitted by: John Dyson


# a58d1fa1 18-Oct-1994 David Greenman <dg@FreeBSD.org>

Fix the remaining vmmeter counters. They all now work correctly.


# 976e77fc 15-Oct-1994 David Greenman <dg@FreeBSD.org>

1) Some of the counters in the vmmeter struct don't fit well into the Mach VM
scheme of things, so I've changed them to be more appropriate. page in/ous
are now associated with the pager that did them. Nuked v_fault as the
only fault of interest that wouldn't be already counted in v_trap is a VM
fault, and this is counted seperately.
2) Implemented most of the remaining counters and corrected the counting of
some that were done wrong. They are all almost correct now...just a few
minor ones left to fix.


# 28e12d63 13-Oct-1994 David Greenman <dg@FreeBSD.org>

Fixed an object reference count problem that was caused by a call to
vm_object_lookup() being outside of some parens. The bug was introduced
via some recently added code.

Reviewed by: John Dyson


# 05f0fdd2 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 4e39a515 07-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. Unused vars and other warnings.


# 5cedf680 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Fixed minor bug caused by some missing parens that can result in slightly
reduced paging performance by missing a clustering opportunity. Found
by Poul-Henning Kamp with gcc -Wall.


# edaaafdb 03-Oct-1994 David Greenman <dg@FreeBSD.org>

Fixed bug related to proper sensing of page modification that we
inadvertantly introduced in pre-1.1.5. This could cause page modifications
to go unnoticed during certain extreme low memory/high paging rate conditions.

Submitted by: John Dyson and David Greenman


# 11b224dc 12-Sep-1994 David Greenman <dg@FreeBSD.org>

Fixed a bug I introduced when fixing the rss limit code. Changed swapout
policy to be a bit more selective about what processes get swapped out.

Reviewed by: John Dyson


# ed74321b 12-Sep-1994 David Greenman <dg@FreeBSD.org>

Don't deactivate pages in 0-refcount objects. Added a couple of missing
paging stats. Fixed problem with free_reserved becoming depleted during
certain swap_pager operations.

Submitted by: John Dyson, with a little help from me


# a647a309 06-Sep-1994 David Greenman <dg@FreeBSD.org>

Simple changes to paging algorithms...but boy do they make a difference.
FreeBSD's paging performance has never been better. Wow.

Submitted by: John Dyson


# a6ca859e 30-Aug-1994 David Greenman <dg@FreeBSD.org>

Fixed bug caused by change of rlimit variables to quad_t's. The bug was in
using min() to calculate the minimum of rss_cur,rss_max - since these
are now quad_t's and min() takes u_ints...the comparison later for exceeding
the rss limit was always true - resulting in rather serious page thrashing.
Now using new qmin() function for this purpose.

Fixed another bug where PG_BUSY pages would sometimes be paged out (bad!).
This was caused by the PG_BUSY flag not being included in a comparison.


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 16f62314 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Incorporated post 1.1.5 work from John Dyson. This includes performance
improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/
pmap_kremove. These routine allow fast mapping of pages for those
architectures that have "normal" MMUs. Also included is a fix to the
pageout daemon to properly check a queue end condition.

Submitted by: John Dyson


# bbc0ec52 03-Aug-1994 David Greenman <dg@FreeBSD.org>

Integrated VM system improvements/fixes from FreeBSD-1.1.5.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 03e6c253 01-Aug-1994 David Greenman <dg@FreeBSD.org>

Removed all code related to the pagescan daemon, and changed 'act_count'
adjustments to compensate for a world without the pagescan daemon.


# 0a620217 06-Jun-1994 David Greenman <dg@FreeBSD.org>

Don't move the page's position in the active queue if it is busy or
held. John has noticed some stability problems when doing this.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources