History log of /freebsd-current/sys/vm/swap_pager.c
Revision Date Author Comments
# 6ada4e8a 08-May-2024 Konstantin Belousov <kib@FreeBSD.org>

swap-like pagers: assert that writemapping decrease does not pass zero

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D45119


# 46966507 08-Apr-2024 Mark Johnston <markj@FreeBSD.org>

swap_pager: Unbusy readahead pages after an I/O error

The swap pager itself allocates readahead pages, so should take care to
unbusy them after a read error, just as it does in the non-error case.

PR: 277538
Reviewed by: olce, dougm, alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44646


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# e61568ae 02-Oct-2023 Mark Johnston <markj@FreeBSD.org>

swap_pager: Fix a race in swap_pager_swapoff_object()

When we disable swapping to a device, we scan the full VM object list
looking for objects with swap trie nodes that reference the device in
question. The pages corresponding to those nodes are paged in.

While paging in, we drop the VM object lock. Moreover, we do not hold a
reference for the object; swap_pager_swapoff_object() merely bumps the
paging-in-progress counter. vm_object_terminate() waits for this
counter to drain before proceeding and freeing pages.

However, swap_pager_swapoff_object() decrements the counter before
re-acquiring the VM object lock, which means that vm_object_terminate()
can race to acquire the lock and free the pages. Then,
swap_pager_swapoff_object() ends up unbusying a freed page. Fix the
problem by acquiring the lock before waking up sleepers.

PR: 273610
Reported by: Graham Perrin <grahamperrin@gmail.com>
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42029


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# f74be55e 25-Apr-2023 Dimitry Andric <dim@FreeBSD.org>

vm: fix a number of functions to match the expected prototypes

Noticed while attempting to make boolean_t unsigned: some vm-related
function declarations and defintions were using boolean_t where they
should have used int, and vice versa.

MFC after: 1 week
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D39753


# 645510e6 09-Dec-2022 Konstantin Belousov <kib@FreeBSD.org>

Provide consistent prototype for swp_pager_meta_free()

This should fix 32bit build breakage.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# baa1ccce 26-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

Make swap_pager_freespace() global

also make it return the count of the swap pages freed, which are not
simultaneously resident in the object.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097


# 5bd45b2b 18-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

swap_pager_find_least(): assert that the function is called on the right object type

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37024


# 26eed2aa 13-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

swap_pager: style, wrap long lines

Reviewed by: brooks, imp (previous version)
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36540


# ccdaa1ab 13-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

vm_overcommit: put into __read_mostly section

Suggested by: mjg
Reviewed by: brooks, imp (previous version)
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36540


# a6cc4c6e 12-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

vm: make vm.overcommit available externally

Reviewed by: brooks, imp (previous version)
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36540


# 54291f7d 18-Jul-2022 Alan Cox <alc@FreeBSD.org>

swap_pager: Reduce the scope of the object lock in putpages

We don't need to hold the object lock while allocating swap space, so
don't.

Reviewed by: dougm, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D35839


# fffc1c59 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm_object: Release object swap charge in the swap pager destructor

With the removal of OBJT_DEFAULT, we can simply handle this in
swap_pager_dealloc(). No functional change intended.

Suggested by: alc
Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35787


# cb6757c0 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

swap_pager: Removing handling for objects with OBJ_SWAP clear

With the removal of OBJT_DEFAULT, we can assume that pager operations
provide an object with OBJ_SWAP set. Also, we do not need to convert
objects from type OBJT_DEFAULT. Thus, remove checks for OBJ_SWAP and
remove code which modifies the object type. In some places, replace the
check for OBJ_SWAP with a check for whether any swap blocks are
assigned.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35786


# b8ebd99a 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

vm: Use __diagused for variables only used in KASSERT().


# 567378cc 10-Apr-2022 Enji Cooper <ngie@FreeBSD.org>

Fix OID format for `vm.swap_reserved` and `vm.swap_total`

The correct OID format for CTLTYPE_U64 is `QU` (`uquad_t`), not `A`
(text expressed via `char *`).

This issue was noticed while doing an sysctl tree walk using a
sysctl(9) consumer that relies on the OID format to intuit what the
type should be for a given sysctl.

MFC after: 1 month
Sponsored by: DellEMC Isilon
Differential Revision: https://reviews.freebsd.org/D34877


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 43b3b8e5 11-Jan-2022 Mark Johnston <markj@FreeBSD.org>

swap_pager: uma_zcreate() doesn't fail

Remove always-false checks for UMA zone creation failure. No functional
change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33809


# 53465702 08-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

swapoff: add one more variant of the syscall

Requested and reviewed by: brooks
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33343


# e8dc2ba2 29-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

swapoff(2): add a SWAPOFF_FORCE flag

The flag requests skipping the heuristic which tries to avoid leaving
system with more allocated memory than available from RAM and remanining
swap.

Reviewed by: markj
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33165


# a4e4132f 29-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

swapoff(2): replace special device name argument with a structure

For compatibility, add a placeholder pointer to the start of the
added struct swapoff_new_args, and use it to distinguish old vs. new
style of syscall invocation.

Reviewed by: markj
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33165


# 6df35944 03-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

swap_pager.c: Remove MPSAFE and ARGSUSED annotations

Reviewed by: markj
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33165


# 0190c38b 26-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

swapoff_one(): only check free pages count manually turning swap off

When swap is turned off due to system shutdown or reboot, ignore the
check. Problem is that the check is not accurate by any means, free
page count can legitimately be low while system still able to page in
everything from the swap. Then, we turn swap off if swapping on
real file or some non-standard geom provider, and typically panic
when system appears to actually need to unavailable page.

For syscall, it is better to be safe than sorry.

Reported and tested by: peterj
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33147


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# b19740f4 24-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

swap_pager: lock vnode in swapdev_strategy()

VOP_STRATEGY() requires locked vnode. Note that we lock the swap vnode
while pages are busy, but this would only cause real LoR if pages belong
to the swap vnode, which must not be the case for correct use.

Reported and tested by: peterj
Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33119


# 6ddf41fa 23-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

swapon: extend the region where the swap vnode is locked

to cover VOP_GETATTR() call in sys_swapon(). Move locking from inside
swapongeom() and swaponvp() into sys_swapon().

Reported by and tested by: peterj
Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33119


# a6d04f34 23-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

swap pager: lock vnode around VOP_CLOSE()

Reported and tested by: peterj
Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33119


# 183f8e1e 28-Sep-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Externalize nsw_cluster_max and initialize it early.

GEOM_ELI needs to know the value, cause it will soon have special
memory handling for IO operations associated with swap.

Move initialization to swap_pager_init(), which is executed at
SI_SUB_VM, unlike swap_pager_swap_init(), which would be executed
only when a swap is configured. GEOM_ELI might need the value at
SI_SUB_DRIVERS, when disks are tasted by GEOM.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D24400


# c6213bef 28-Sep-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Add flag BIO_SWAP to mark IOs that are associated with swap.

Submitted by: jtl
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D24400


# 686aa928 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

swap_pager: Handle large swap_pager_reserve() requests

This interface is used solely by md(4) when the MD_RESERVE flag is
specified, as in `mdconfig -a -t swap -s 1G -o reserve`. It
pre-allocates swap blocks for the entire object.

The number of blocks to be reserved is specified as a vm_size_t, but
swp_pager_getswapspace() can allocate at most INT_MAX blocks. vm_size_t
also seems like the incorrect type to use here it refers only to the
size of the VM object, not the size of a mapping. So:
- change the type of "size" in swap_pager_reserve() to vm_pindex_t, and
- clamp the requested number of blocks for a single
swp_pager_getswapspace() call to INT_MAX.

Reported by: syzkaller
Reviewed by: dougm, alc, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31875


# 28bc23ab 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

tmpfs: dynamically register tmpfs pager

Remove OBJT_SWAP_TMPFS. Move tmpfs-specific swap pager bits into
tmpfs_subr.c.

There is no longer any code to directly support tmpfs in sys/vm, most
tmpfs knowledge is shared by non-anon swap object type implementation.
The tmpfs-specific methods are provided by registered tmpfs pager, which
inherits from the swap pager.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 00a3fe96 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_object_kvme_type(): reimplement by embedding kvme_type into pagerops

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# 06d1fd9f 12-May-2021 Mark Johnston <markj@FreeBSD.org>

swap_pager: Zero swap info before exporting to userspace

Otherwise padding bytes are leaked.

Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# d474440a 03-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Constify vm_pager-related virtual tables.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 4b8365d7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add OBJT_SWAP_TMPFS pager

This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.

Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.

This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# a7c198a2 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Implement vm_object_vnode() using vm_pager_getvp()

Allow vp_heldp argument to be NULL, in which case the returned vnode
is not held for tmpfs swap objects.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 1390a5cb 01-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_freespace method

Makes the code in vm_object collapse/page_remove cleaner

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 192112b7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_getvp method

This eliminates the staircase of conditions in vm_map_entry_set_vnode_text().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# c23c555b 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_mightbedirty method

Used to implement vm_object_mightbedirty()

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 180bcaa4 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_pager: add pgo_set_writeable_dirty method

specialized for swap and vnode pagers, and used to implement
vm_object_set_writeable_dirty().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# a0850dd0 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

swappagerops: slightly more style-compliant formatting

Remove excess spaces from comments.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# ecfbddf0 14-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

sysctl vm.objects: report backing object and swap use

For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object. This allows userspace
to deconstruct the shadow chain. Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field. I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29771


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# c3aa3bf9 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: clean up empty lines in .c and .h files


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# 00fd73d2 25-Jul-2020 Doug Moore <dougm@FreeBSD.org>

Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64. Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX. Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by: jmallett
Reviewed by: alc
Discussed with: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D25736


# ee744122 24-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: fix swap reservation leak and clean up surrounding code

The code did not subtract from the global counter if per-uid reservation
failed.

Cleanup highlights:
- load overcommit once
- move per-uid manipulation to dedicated routines
- don't fetch wire count if requested size is below the limit
- convert return type from int to bool
- ifdef the routines with _KERNEL to keep vm.h compilable by userspace

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D25787


# 126a2470 23-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: annotate swap_reserved with __exclusive_cache_line

The counter keeps being updated all the time and variables read afterwards
share the cacheline. Note this still fundamentally does not scale and needs
to be replaced, in the meantime gets a bandaid.

brk1_processes -t 52 ops/s:
before: 8598298
after: 9098080


# 7ce3a312 09-Jun-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: rework swap_pager_status to execute in constant time

The lock-protected iteration is trivially avoidable.

This removes a serialisation point from Linux binaries (which end up calling
here from the sysinfo syscall).


# d869a17e 06-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23978


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 36b01270 23-Feb-2020 Doug Moore <dougm@FreeBSD.org>

The last argument to swp_pager_getswapspace is always 1. Remove that argument.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23810


# 7ca55392 23-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Allow swap_pager_putpages() to allocate one block at a time.

The minimum allocation size of 4 blocks is an old policy that came with
the "new" swap pager in r42957. Since then the blist allocator has
gotten better at reducing fragmentation; for example, with r349777 it
can return a range that spans multiple leaves. When swap space is close
to being exhaused, the minimum of 4 blocks most likely exacerbates
memory pressure, so reduce it to 1.

Reported by: alc
Tested by: pho
Reviewed by: alc, dougm, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23763


# 6c5f36ff 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in
virtual address or physical page allocation need to be marked with this
flag.

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23712


# 34e2051f 17-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Remove swblk_t.

It was used only to store the bounds of each swap device. However,
since swblk_t is a signed 32-bit int and daddr_t is a signed 64-bit
int, swp_pager_isondev() may return an invalid result if swap devices
are repeatedly added and removed and sw_end for a device ends up
becoming a negative number.

Note that the removed comment about maximum swap size still applies.

Reviewed by: jeff, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23666


# 725b4ff0 17-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Fix a swap block allocation race.

putpages' allocation of swap blocks is done under the global sw_dev
lock. Previously it would drop that lock before inserting the allocated
blocks into the object's trie, creating a window in which swap blocks
are allocated but are not visible to swapoff. This can cause
swp_pager_strategy() to fail and panic the system.

Fix the problem bluntly, by allocating swap blocks under the object
lock.

Reviewed by: jeff, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23665


# c90d075b 17-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Fix object locking races in swapoff(2).

swap_pager_swapoff_object()'s goal is to allocate pages for all valid
swap blocks belonging to the object, for which there is no resident
page. If the page corresponding to a block is already resident and
valid, the block can simply be discarded.

The existing implementation tries to minimize the number of I/Os used.
For each cluster of swap blocks, it finds maximal runs of valid swap
blocks not resident in memory, and valid resident pages. During this
processing, the object lock may be dropped in several places: when
calling getpages, or when blocking on a busy page in
vm_page_grab_pages(). While the lock is dropped, another thread may
free swap blocks, causing getpages to page in stale data.

Fix the problem following a suggestion from Jeff: use getpages'
readahead capability to perform clustering rather than doing it
ourselves. The simplies the code a bit without reintroducing the old
behaviour of performing one I/O per page.

Reviewed by: jeff
Reported by: dhw, gallatin
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23664


# d6e13f3b 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't hold the object lock while calling getpages.

The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033


# 98087a06 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Make collapse synchronization more explicit and allow it to complete during
paging.

Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object. This gives us an explicit test rather than overloading
paging-in-progress. While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.

Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain. For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse. It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.

This change makes collapse more reliable as a secondary benefit. The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.

This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 9f5632e6 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking for queue operations.

With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885


# a8081778 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# 67388836 01-Dec-2019 Konstantin Belousov <kib@FreeBSD.org>

Store the bottom of the shadow chain in OBJ_ANON object->handle member.

The handle value is stable for all shadow objects in the inheritance
chain. This allows to avoid descending the shadow chain to get to the
bottom of it in vm_map_entry_set_vnode_text(), and eliminate
corresponding object relocking which appeared to be contending.

Change vm_object_allocate_anon() and vm_object_shadow() to handle more
of the cred/charge initialization for the new shadow object, in
addition to set up the handle.

Reported by: jeff
Reviewed by: alc (previous version), jeff (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differrential revision: https://reviews.freebsd.org/D22541


# 63967687 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates
reudundant complicated checks and additional locking required only for
anonymous memory. Introduce vm_object_allocate_anon() to create these
objects. DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.

Reviewed by: alc, dougm, kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22119


# 8ecbf14b 19-Nov-2019 Doug Moore <dougm@FreeBSD.org>

Drop the extra argument from swp_pager_meta_ctl and have it do lookup
only. Rename it swp_pager_meta_lookup. Stop checking for obj->type
== swap there and assert it instead. Make the caller responsible for
the obj->type check.

Move the meta_ctl 'pop' functionality to swap_pager_unswapped, the
only place that uses it, and assume obj->type == swap there too.

Assisted by: ota_j.email.ne.jp
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22437


# abdab7b6 17-Nov-2019 Doug Moore <dougm@FreeBSD.org>

Add a helper function for testing a swap block and freeing it if empty.

Submitted by: ota_j.email.ne.jp
Approved by: alc, kib, dougm
Differential Revision: https://reviews.freebsd.org/D22402


# 467057fc 11-Nov-2019 Doug Moore <dougm@FreeBSD.org>

swap_pager_meta_free() frees allocated blocks in a way that
exploits the sparsity of allocated blocks in a range, without
issuing an "are you there?" query for every block in the range.
swap_pager_copy() is not so smart. Modify the implementation
of swap_pager_meta_free() slightly so that swap_pager_copy()
can use that smarter implementation too.

Based on an observation of: Yoshihiro Ota (ota_j.email.ne.jp)
Reviewed by: kib,alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22280


# 303fa05a 17-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

swapon_check_swzone(): use already calculated static variables.

Submitted by: ota@j.email.ne.jp
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22065


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 2288078c 08-Oct-2019 Doug Moore <dougm@FreeBSD.org>

Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# fe7bcbaf 03-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

vm pager: writemapping accounting for OBJT_SWAP

Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21456


# cf27e0d1 19-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by: markj
Tested by: pho (as part of a larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21311


# 504f5e29 14-Aug-2019 Doug Moore <dougm@FreeBSD.org>

swap_pager.c reserves 2 blocks for a bsd label. Change that 2 to the
expression howmany(BBSIZE, PAGE_SIZE), where BBSIZE is the size of the
boot block area. That can be less than 2 if PAGE_SIZE is big.

swapon(8) has an option to trim (delete) all the blocks of a device at
startup. However, if the first of those blocks is a bsd label, then
trimming those blocks is destructive. Change swapon to leave the
first BBSIZE bytes untrimmed.

Update manual pages to reflect changes in how swapon and how it may be
used, espeically in association with savecore.

Reviewed by: alc
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21191


# 23612f0d 28-Jul-2019 Doug Moore <dougm@FreeBSD.org>

In swap_pager_putpages, move the initialization of a free-blocks
counter, and the final freeing of freed swap blocks, outside the
region where an object lock is held. Correct some style(9) and
spelling errors. Change a panic() to a KASSERT(). Change a boolean_t
to a bool.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D21093


# 7b9bcad9 07-Jul-2019 Doug Moore <dougm@FreeBSD.org>

A style-related change, r349791, made unclear the meaning of a
comment. Rewrite that comment to improve its clarity.

Reported by: cem
Reviewed by: alc, cem
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20871


# 0cab71bc 06-Jul-2019 Doug Moore <dougm@FreeBSD.org>

Fix style(9) violations involving division by PAGE_SIZE.

Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20847


# 31c82722 06-Jul-2019 Doug Moore <dougm@FreeBSD.org>

Change blist_next_leaf_alloc so that it can examine more than one leaf
after the one where the possible block allocation begins, and allocate
a larger number of blocks than the current limit. This does not affect
the limit on minimum allocation size, which still cannot exceed
BLIST_MAX_ALLOC.

Use this change to modify swp_pager_getswapspace and its callers, so
that they can allocate more than BLIST_MAX_ALLOC blocks if they are
available.

Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20579


# 56948d17 05-Jul-2019 Doug Moore <dougm@FreeBSD.org>

Based on work posted at https://reviews.freebsd.org/D13484, change
swap_pager_swapoff_object and swp_pager_force_pagein so that they can
page in multiple pages at a time to a swap device, rather than doing
one I/O operation for each page.

Tested by: pho
Submitted by: ota_j.email.ne.jp (Yoshihiro Ota)
Reviewed by: alc, markj, kib
Approved by: kib, markj (mentors)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20635


# 7c022327 08-Jun-2019 Doug Moore <dougm@FreeBSD.org>

Simple code refactoring originally in D13484.

Extract swp_pager_force_dirty() and swp_pager_force_launder() out of
swp_pager_force_pagein().

Extract swap_pager_swapoff_object() out of swap_pager_swapoff().

Submitted by: ota_j.email.ne.jp
Reviewed by: alc, dougm
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20545


# 21d77284 03-Jun-2019 Konstantin Belousov <kib@FreeBSD.org>

Remove dead store.

sw_flags is set to the function argument several lines later.

Reported by: danfe using PVS-studio
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d842aa51 01-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a vm_page_wired() predicate.

Use it instead of accessing the wire_count field directly. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485


# e2e050c8 19-May-2019 Conrad Meyer <cem@FreeBSD.org>

Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.


# 87ae0686 11-May-2019 Doug Moore <dougm@FreeBSD.org>

A new parameter to blist_alloc specifies an upper bound on the size of
the allocation request, so that the blocks allocated are from the next
set of free blocks big enough to satisfy the minimum requirements of
the request, and the number of blocks allocated are as many as
possible, up to the specified maximum. The implementation of
swp_pager_getswapspace uses this parameter to ask for a number of
blocks between the new halved request size and the previous failed
request size. Thus a request for 32 blocks may fail, but instead of
getting only 16 blocks instead, the caller asks for 16 to 31 next, and
might get 19 or 27, which is closer to what they originally wanted.

I expect this to lead to bigger block allocations and less block
fragmentation, at least in some cases.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20001


# 48e98a2a 11-May-2019 Doug Moore <dougm@FreeBSD.org>

Callers of swp_pager_getswapspace get either as many blocks as they
requested, or none, and in the latter case it is up to them to pick a
smaller request to make - which they always do by halving the failed
request. This change to swp_pager_getswapspace leaves the task of
downsizing the request to the function and not its caller. It still
does so by halving the original request.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20228


# 0b208315 26-Mar-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Improve error reporting when the swap pager runs out of memory.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D19699


# f6d281e8 10-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

struct xswdev on amd64 requires compat32 shims after ino64.

i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 756a5412 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.

o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.

Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.

The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with: gallatin
Tested by: pho


# a8233027 01-Dec-2018 Konstantin Belousov <kib@FreeBSD.org>

Allow to create swap zone larger than v_page_count / 2.

If user configured the maxswapzone tunable, just take the literal
value for the initial zone sizing attempt. Before, it was only
possible to reduce the zone by the tunable.

While there, correct the message which was not correct when zone
creation rounded the size up.

Reported by: jmg
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18381


# 541a1175 19-Nov-2018 Alan Cox <alc@FreeBSD.org>

Use swp_pager_isondev() throughout. Submitted by: ota@j.email.ne.jp

Change swp_pager_isondev()'s return type to bool.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16712


# 150d384e 07-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Fix a use-after-free in swp_pager_meta_free().

This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934. In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed. After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.

Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17897


# e8bb589d 04-Oct-2018 Matt Macy <mmacy@FreeBSD.org>

eliminate locking surrounding ui_vmsize and swap reserve by using atomics

Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273


# f5fbe90d 24-Sep-2018 Alan Cox <alc@FreeBSD.org>

Passing UMA_ZONE_NOFREE to uma_zcreate() for swpctrie_zone and swblk_zone is
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone. (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)

Reviewed by: kib, markj
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17296


# 78f1deef 07-Aug-2018 Alan Cox <alc@FreeBSD.org>

Defer and aggregate swap_pager_meta_build frees.

Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one. To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.

Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.

Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib, markj (an earlier version)
Tested by: pho
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13707


# b7b8a096 14-Jun-2018 Konstantin Belousov <kib@FreeBSD.org>

Handle the race between fork/vm_object_split() and faults.

If fault started before vmspace_fork() locked the map, and then during
fork, vm_map_copy_entry()->vm_object_split() is executed, it is
possible that the fault instantiate the page into the original object
when the page was already copied into the new object (see
vm_map_split() for the orig/new objects terminology). This can happen
if split found a busy page (e.g. from the fault) and slept dropping
the objects lock, which allows the swap pager to instantiate
read-behind pages for the fault. Then the restart of the scan can see
a page in the scanned range, where it was already copied to the upper
object.

Fix it by instantiating the read-ahead pages before
swap_pager_getpages() method drops the lock to allocate pbuf. The
object scan would see the whole range prefilled with the busy pages
and not proceed the range.

Note that vm_fault rechecks the map generation count after the object
unlock, so that it restarts the handling if raced with split, and
re-lookups the right page from the upper object.

In collaboration with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 3f060b60 16-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Use the conventional name for an array of pages.

No functional change intended.

Discussed with: kib
MFC after: 3 days


# e958ad4c 12-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# 4abca9bb 30-Dec-2017 Alan Cox <alc@FreeBSD.org>

Previously, swap_pager_copy() freed swap blocks one at at time, via
swp_pager_meta_ctl(), with no opportunity to recognize freeing of
consecutive blocks and free fewer block ranges. To open that opportunity,
this change removes the SWM_FREE option from swp_pager_meta_ctl(), and
compels the caller to do the freeing when a valid block address is returned.
In swap_pager_copy(), these frees are aggregated, so that a sequence of them
can be done at one time.

The only other caller to swp_pager_meta_ctl() that passed SWM_FREE,
swp_pager_unswapped(), is also modified to handle its single free
explicitly.

Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13290


# 230869e0 28-Nov-2017 Alan Cox <alc@FreeBSD.org>

When the swap pager allocates space on disk, it requests contiguous
blocks in a single call to blist_alloc(). However, when it frees
that space, it previously called blist_free() on each block, one at a
time. With this change, the swap pager identifies ranges of
contiguous blocks to be freed, and calls blist_free() once per
range. In one extreme case, that is described in the review, the time
to perform an munmap(2) was reduced by 55%.

Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12397


# df57947f 18-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

spdx: initial adoption of licensing ID tags.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133


# 8d6fbbb8 07-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# be7d4ac5 22-Oct-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add OID for the vm.overcommit sysctl. This makes it possible to remove
one call to sysctl(2) from jemalloc startup code. (That also requires
changes to jemalloc, but I plan to push those to upstream first.)

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D12745


# 1fffcd75 18-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not report reduction of swap zone if it was not.

After r324600 we see the actual reservation.

Reported by: jkim
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 53faf5a7 13-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Evaluate the real size of the sblk_zone.

Submitted by: ota@j.email.ne.jp
PR: 221356
Reviewed by: alc, markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D12660


# 37244a84 08-Oct-2017 Alan Cox <alc@FreeBSD.org>

Replace an unnecessary call to vm_page_activate() by an assertion that
the page is already wired or queued. Prior to the elimination of PG_CACHED
pages, vm_page_grab() might have returned a valid, previously PG_CACHED
page, in which case enqueueing the page was necessary. Now, that can't
happen. Moreover, activating the page is a dubious choice, since the page
is not being accessed.

Reviewed by: kib
MFC after: 1 week


# 41e5a226 01-Oct-2017 Alan Cox <alc@FreeBSD.org>

When an I/O error occurs on page out, there is no need to dirty the page,
because it is already dirty. Instead, assert that the page is dirty.

Reviewed by: kib, markj
MFC after: 1 week


# d027ed2e 10-Sep-2017 Alan Cox <alc@FreeBSD.org>

To analyze the allocation of swap blocks by blist functions, add a method
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks. The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers. The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.

The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.

Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11906


# 85d88d87 06-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not leak empty swblk.

In swp_pager_meta_build(), if the requested operation results in
freeing the last swap pointer in the swblk, free the trie node. Other
swap pager code does not expect to find completely empty swblk.

Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# eed99cb8 06-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

In swp_pager_meta_build(), handle a race with other thread allocating
swapblk for our index while we dropped the object lock.

Noted by: jeff
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 35872e79 30-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Adjust interface of swapon_check_swzone() to its actual usage.

The function return value is not used. Its argument is always
swap_total/PAGE_SIZE, so make it not take any arguments.

Submitted by: ota@j.email.ne.jp
PR: 221356
MFC after: 1 week


# f08b3099 30-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Make the swap_pager_full variable static.

r290920 removed the use of the variable from vm/vm_pageout.c.

Submitted by: ota@j.email.ne.jp
PR: 221356
MFC after: 1 week


# ee620ea4 28-Aug-2017 Alan Cox <alc@FreeBSD.org>

Update a couple vm_object lock assertions in the swap pager to reflect the
new use of the vm_object's lock to synchronize updates to a radix trie
mapping per-vm object page indices to on-disk swap blocks.

Fix a typo in a nearby comment.

Reviewed by: kib, markj
X-MFC with: r322913
Differential Revision: https://reviews.freebsd.org/D12134


# f425ab8e 25-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Replace global swhash in swap pager with per-object trie to track swap
blocks assigned to the object pages.

- The global swhash_mtx is removed, trie is synchronized by the
corresponding object lock.
- The swp_pager_meta_free_all() function used during object
termination is optimized by only looking at the trie instead of
having to search whole hash for the swap blocks owned by the object.
- On swap_pager_swapoff(), instead of iterating over the swhash,
global object list have to be inspected. There, we have to ensure
that we do see valid trie content if we see that the object type is
swap.
Sizing of the swblk zone is same as for swblock zone, each swblk maps
SWAP_META_PAGES pages.

Proposed by: alc
Reviewed by: alc, markj (previous version)
Tested by: alc, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D11435


# 9680bb98 19-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Remove unused function swap_pager_isswapped().

Noted by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e2241590 24-Jun-2017 Alan Cox <alc@FreeBSD.org>

Increase the pageout cluster size to 32 pages.

Decouple the pageout cluster size from the size of the hash table entry
used by the swap pager for mapping (object, pindex) to a block on the
swap device(s), and keep the size of a hash table entry at its current
size.

Eliminate a pointless macro.

Reviewed by: kib, markj (an earlier version)
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D11305


# 3a5d839e 20-Jun-2017 Alan Cox <alc@FreeBSD.org>

Eliminate an unused macro.

MFC after: 3 days


# 87b0ab69 16-Jun-2017 Alan Cox <alc@FreeBSD.org>

Pages that are passed to swap_pager_putpages() should already be fully
dirty. Assert that they are fully dirty rather than redundantly calling
vm_page_dirty() on them.

Reviewed by: kib, markj
MFC after: 1 week
X-MFC after: r319932


# 761097c8 06-Jun-2017 Alan Cox <alc@FreeBSD.org>

Starting in r118390, swaponsomething() began to reserve the blocks at the
beginning of a swap area for a disk label. However, neither r118390 nor
r118544, which increased the reservation from one to two blocks, correctly
accounted for these blocks when updating the variable "swap_pager_avail".
This change corrects that error.

Reviewed by: kib
MFC after: 5 days


# 03bdd65f 05-Jun-2017 Alan Cox <alc@FreeBSD.org>

When the function blist_fill() was added to the kernel in r107913, the swap
pager used a different scheme for striping the allocation of swap space
across multiple devices. And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size. Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected. Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme. This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.

Reviewed by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11043


# 064650c1 05-Jun-2017 Alan Cox <alc@FreeBSD.org>

Halve the memory being internally allocated by the blist allocator. In
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and
r96851.)

Reviewed by: kib, markj
Tested by: pho
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11028


# 07c348ea 27-May-2017 Alan Cox <alc@FreeBSD.org>

After r118390, the variable "dmmax" was neither the correct strip size
nor the correct maximum block size. Moreover, after r318995, it serves
no purpose except to provide information to user space through a read-
sysctl.

This change eliminates the variable "dmmax" but retains the sysctl. It
also corrects the value returned by the sysctl.

Reviewed by: kib, markj
MFC after: 3 days


# fe71561a 27-May-2017 Alan Cox <alc@FreeBSD.org>

In r118390, the swap pager's approach to striping swap allocation over
multiple devices was changed. However, swapoff_one() was not fully and
correctly converted. In particular, with r118390's introduction of a per-
device blist, the maximum swap block size, "dmmax", became irrelevant to
swapoff_one()'s operation. Moreover, swapoff_one() was performing out-of-
range operations on the per-device blist that were silently ignored by
blist_fill().

This change corrects both of these problems with swapoff_one(), which will
allow us to potentially increase MAX_PAGEOUT_CLUSTER. Previously,
swapoff_one() would panic inside of blist_fill() if you increased
MAX_PAGEOUT_CLUSTER.

Reviewed by: kib, markj
MFC after: 3 days


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# b1fd102e 02-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Add a page queue for holding dirty anonymous unswappable pages.

On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by: alc


# 2e56b64f 24-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix argument type and microoptimize swp_pager_meta_free().

The count argument natural type if vm_pindex_t, but due to the loop
organization, it has to be signed type to detect the termination
condition. Replace this logic by using distinguished counter for the
processed pages, and terminate loop when the counter exceeds the
argument.

Completely process one swblock for all relevant indexes instead of
doing relookup in hash when incrementing page index on the loop step.

Do not drop hash mutex around iterations.

Noted and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 77d6fd97 18-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Improve vm_object_scan_all_shadowed() to also check swap backing objects.

As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index. Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.

Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero. When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.

Suggested and reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 71057cd2 16-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

In swp_pager_meta_free_all(), fix type of the index variable. Style.

Noted and reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# bba39b9a 22-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# ebcddc72 09-Nov-2016 Alan Cox <alc@FreeBSD.org>

Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302


# dd9cb6da 03-Sep-2016 Mark Johnston <markj@FreeBSD.org>

Respect the caller's hints when performing swap readahead.

The pager getpages interface allows the caller to bound the number of
readahead and readbehind pages, and vm_fault_hold() makes use of this
feature. These bounds were ignored after r305056, causing the swap pager
to potentially page in more than the specified number of pages.

Reported and reviewed by: alc
X-MFC with: r305056


# 98150664 31-Aug-2016 Konstantin Belousov <kib@FreeBSD.org>

Make swapoff reliable.

The swap_pager_swapoff() function uses trylock for the object lock
before pagein, which means that either i/o to md(4) over swap, or
intensive page faults over swap pager objects might prevent swapoff()
from making any progress. Then the retry < 100 check fails and machine
panics.

If trylock fails, acquire the object lock in the blockable way and
restart the hash bucket walk. Keep retries logic for now.

Reported and tested by: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7688


# 915d1b71 29-Aug-2016 Mark Johnston <markj@FreeBSD.org>

Restore swap pager readahead after r292373.

The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.

Reviewed by: alc, kib
Tested by: pho
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7677


# 0c657d22 03-Aug-2016 Konstantin Belousov <kib@FreeBSD.org>

Explain why swapgeom_close_ev() is delegated.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 88ad2d7b 28-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Do not delegate a work to geom event thread which can be done inline.

In particular, swapongeom_ev() needed event thread context when swap
pager configuration was performed under Giant and geom asserted that
Giant is not owned. Now both of the reason went away.

On the other hand, note that swpageom_release() is called from the
bio_done context, and possible close cannot be performed inline.

Also fix some minor issues. The swapgeom() function does not use the
td argument, remove it. Recheck that the vnode passed is still VCHR
and not reclaimed after the lock.

Reviewed by: mav
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 2174a0c6 28-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix style and typo.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# eb4d6a1b 12-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix inconsistent locking of the swap pager named objects list.

Right now, all modifications of the list are locked by sw_alloc_mtx.
But initial lookup of the object by the handle in swap_pager_alloc()
is not protected by sw_alloc_mtx, which means that
vm_pager_object_lookup() could follow freed pointer.

Create a new named swap object with the OBJT_SWAP type, instead
of OBJT_DEFAULT. With this change, swp_pager_meta_build() never need
to upgrade named OBJT_DEFAULT to OBJT_SWAP (in the other place, we do
not forbid for client code to create named OBJT_DEFAULT objects at
all).

That change allows to remove sw_alloc_mtx and make the list locked by
sw_alloc_sx lock. Update swap_pager_copy() to new locking mode.

Create helper swap_pager_alloc_init() to consolidate named and
anonymous swap objects creation, while a caller ensures that the
neccesary locks are held around the helper.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)


# 15719273 12-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Explicitely initialize sw_alloc_sx. Currently it is not initialized
but works due to zeroed out bss on startup.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)


# 9a204708 24-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Remove Giant around allocation of the swap pager with non-NULL handle.
Existing issue of not protecting pager_object_list iteration in
vm_pager_object_lookup() by sw_alloc_mtx is not affected by Giant
removal.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 4c36e917 22-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Mark swap-related proc sysctls as not requiring Giant.

Reviewed by: alc (as part of larger patch)
Sponsored by: The FreeBSD Foundation


# 04533e1e 22-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Replace hand-made exclusive lock, protecting against parallel
swapon/swapoff invocations, with sx.

Reviewed by: alc (as part of larger patch)
Sponsored by: The FreeBSD Foundation


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# b0cd2017 16-Dec-2015 Gleb Smirnoff <glebius@FreeBSD.org>

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# d635a37f 05-Oct-2015 Warner Losh <imp@FreeBSD.org>

Mark swap_pager_putpages static at its definition. It was already
static at its declaration. Remove needless swapdev_strategy forward
declaration.

MFC After: 3 days


# 9e3e3fe5 08-Sep-2015 Warner Losh <imp@FreeBSD.org>

The swap pager is compatible with direct dispatch. It does its own
locking and doesn't sleep. Flag the consumer we create as such. In
addition, decrement the in flight index when we have an out of memory
error after having incremented it previously. This would have
prevented swapoff from working if the swap pager ever hit a resource
shortage trying to swap out something (the swap in path always waits
for a bio, so won't have this issue). Simplify the close logic by
abandoning the use of private and initializing the index to 1 and
dropping that reference when we previously set private.

Also, set sw_id only while sw_dev_mtx is held. This should only affect
swapping to a vnode, as opposed to a geom whose close always sets it to
NULL with sw_dev_mtx held.

Differential Review: https://reviews.freebsd.org/D3547


# 77923df2 21-Aug-2015 Alan Cox <alc@FreeBSD.org>

Eliminate pointless assignments to rtvals[] in swap_pager_putpages().

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# fade8dd7 23-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.

In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division


# 093ebe1d 17-Jun-2015 Gleb Smirnoff <glebius@FreeBSD.org>

o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# e735691b 08-May-2015 John Baldwin <jhb@FreeBSD.org>

Place VM objects on the object list when created and never remove them.
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.

Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero. This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)

Differential Revision: https://reviews.freebsd.org/D2423
Reviewed by: alc, kib (earlier version)
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc.


# 89c241d1 02-May-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Instead of reading, validating and adjusting value of the vm.swap_async_max
in the main swapper work cycle, do it in the sysctl handler. This removes
extra mutex acquisition from the main cycle and makes the sysctl knob return
error on an invalid value, instead of accepting and fixing it.

Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 4b5c9cf6 29-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 0ada3afc 09-Apr-2015 Alexander Motin <mav@FreeBSD.org>

Remove sleeps from geom_up thread on device destruction.

MFC after: 3 days.


# 3398491b 26-Mar-2015 Alexander Motin <mav@FreeBSD.org>

Make swapper release orphaned (lost) GEOM provider.

Swap device is still reported as enabled, and system still may crash later
if some swapped-out kernel pages were lost with the device, but at least
GEOM and CAM can now release the lost disk, allowing it to be reconnected.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# d9328101 23-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

\n at end of panicstr is redundant.

Submitted by: alc


# 90effb23 22-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile:

o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# e065e87c 04-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix mis-spelling of bits and types names in the
default_pager_putpages() and swap_pager_putpages().
It is the same fix as was done for vnode_pager_putpages()
in r271586.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 61203277 25-Apr-2014 Dag-Erling Smørgrav <des@FreeBSD.org>

Add sysctl OIDs showing the actual size and capacity of the swap zone.

MFC after: 1 week


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 0d8243cc 18-Mar-2014 Attilio Rao <attilio@FreeBSD.org>

vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock,
then threads can sleep on the pip condition.
Avoid to deadlock such threads by correctly awakening the sleeping ones
after the pip is finished.
swapoff side of the bug can likely result in shutdown deadlocks.

Sponsored by: EMC / Isilon Storage Division
Reported by: pho, pluknet
Tested by: pho


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 5a3c920f 11-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

When swap pager allocates metadata in the pagedaemon context, allow it
to drain the reserve. This was broken in r243040, causing deadlock.
Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon
would only wait for the v_pageout_free_min anyway.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 56ce850b 09-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix typo in comment.

MFC after: 3 days


# 002f377a 06-Jun-2013 Attilio Rao <attilio@FreeBSD.org>

Complete r251452:
Avoid to busy/unbusy a page in cases where there is no need to drop the
vm_obj lock, more nominally when the page is full valid after
vm_page_grab().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# c25673ff 28-May-2013 Attilio Rao <attilio@FreeBSD.org>

o Change the locking scheme for swp_bcount.
It can now be accessed with a write lock on the object containing it OR
with a read lock on the object containing it along with the swhash_mtx.
o Remove some duplicate assertions for swap_pager_freespace() and
swap_pager_unswapped() but keep the object locking references for
documentation.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 2cc718a1 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not map the swap i/o pbufs if the geom provider for the swap
partition accepts unmapped requests.

Sponsored by: The FreeBSD Foundation
Tested by: pho


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# a4915c21 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


# 0dde287b 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 02c62349 19-Nov-2012 Jaakko Heinonen <jh@FreeBSD.org>

- Don't pass geom and provider names as format strings.
- Add __printflike() attributes.
- Remove an extra argument for the g_new_geomf() call in swapongeom_ev().

Reviewed by: pjd


# f379b823 04-Sep-2012 Dag-Erling Smørgrav <des@FreeBSD.org>

Whitespace cleanup.


# dc1b35b5 04-Sep-2012 Dag-Erling Smørgrav <des@FreeBSD.org>

No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.

(for real this time)


# 22a5e6b9 04-Sep-2012 Dag-Erling Smørgrav <des@FreeBSD.org>

Revert previous commit, which was performed in the wrong tree.


# db0390e8 04-Sep-2012 Dag-Erling Smørgrav <des@FreeBSD.org>

No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.


# 9462305c 27-Aug-2012 Sergey Kandaurov <pluknet@FreeBSD.org>

Typo in previous change: print half the theoretical maximum as maximum
recommended amount.

Reported by: <site freebsd at orientalsensation com>
Reviewed by: des


# 3ff863f1 16-Aug-2012 Dag-Erling Smørgrav <des@FreeBSD.org>

- When running out of swzone, instead of spewing an error message every
tick until the situation is resolved (if ever), just print a single
message when running out and another when space becomes available.

- When adding more swap, warn if the total amount exceeds half the
theoretical maximum we can handle.


# 6031c68d 16-Jun-2012 Alan Cox <alc@FreeBSD.org>

The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by: kib
MFC after: 6 weeks


# 0a4a2b8e 01-Jun-2012 Eitan Adler <eadler@FreeBSD.org>

Revert r236380

PR: kern/166780
Requested by: many
Approved by: cperciva (implicit)


# 71ee98c9 31-May-2012 Eitan Adler <eadler@FreeBSD.org>

Add sysctl to query amount of swap space free

PR: kern/166780
Submitted by: Radim Kolar <hsn@sendmail.cz>
Approved by: cperciva
MFC after: 1 week


# 7870adb6 09-Feb-2012 Ed Schouten <ed@FreeBSD.org>

Remove direct access to si_name.

Code should just use the devtoname() function to obtain the name of a
character device. Also add const keywords to pieces of code that need it
to build properly.

MFC after: 2 weeks


# 8f12d83a 01-Feb-2012 Alexander Motin <mav@FreeBSD.org>

Fix NULL dereference panic on attempt to turn off (on system shutdown)
disconnected swap device.

This is quick and imperfect solution, as swap device will still be opened
and GEOM will not be able to destroy it. Proper solution would be to
automatically turn off and close disconnected swap device, but with existing
code it will cause panic if there is at least one page on device, even if
it is unimportant page of the user-level process. It needs some work.

Reviewed by: kib@
MFC after: 1 week


# 134465d7 12-Dec-2011 Konstantin Belousov <kib@FreeBSD.org>

Fix printf.

Submitted by: az
MFC after: 1 week


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# 15523cf7 22-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

Update some comments in swap_pager.c.

Reviewed and most wording by: alc
MFC after: 1 week
Approved by: re (bz)


# 6e903bd0 22-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

Apply the limit to avoid the overflows in the radix tree subr_blist.c
after the conversion of the swap device size to the page size units,
not before. That lifts the limit on the usable swap partition size
from 32GB to 256GB, that is less depressing for the modern systems.

Submitted by: Alexander V. Chernikov <melifaro ipfw ru>
Reviewed by: alc
Approved by: re (bz)
MFC after: 2 weeks


# dda4f960 01-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

Implement the linprocfs swaps file, providing information about the
configured swap devices in the Linux-compatible format.

Based on the submission by: Robert Millan <rmh debian org>
PR: kern/159281
Reviewed by: bde
Approved by: re (kensmith)
MFC after: 2 weeks


# afcc55f3 06-Jul-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# cec9f109 26-Apr-2011 David E. O'Brien <obrien@FreeBSD.org>

Reap old SPL comments.

Reviewed by: alc


# 1ba5ad42 05-Apr-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add accounting for most of the memory-related resources.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 2860553a 01-Mar-2011 Rebecca Cran <brucec@FreeBSD.org>

Change the return type of vmspace_swap_count to a long to match the other
vmspace_*_count functions.

MFC after: 3 days


# 65d8409c 23-Feb-2011 Rebecca Cran <brucec@FreeBSD.org>

Calculate and return the count in vmspace_swap_count as a vm_offset_t
instead of an int to avoid overflow.

While here, clean up some style(9) issues.

PR: kern/152200
Reviewed by: kib
MFC after: 2 weeks


# 2c4992db 17-Jan-2011 Alan Cox <alc@FreeBSD.org>

Move the definition of M_VMPGDATA to the swap pager, where the only
remaining uses are.


# 4c18dec9 01-Jan-2011 Rebecca Cran <brucec@FreeBSD.org>

There can be more than 0x20000000 swap meta blocks allocated if a swap-backed
md(4) device is used. Don't panic when deallocating such a device if swap
has been used.

PR: kern/133170
Discussed with: kib
MFC after: 3 days


# ef694c1a 02-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object". This is required to make it possible to account
for per-jail swap usage.

Reviewed by: kib@
Tested by: pho@
Sponsored by: FreeBSD Foundation


# 55144670 19-Oct-2010 Andriy Gapon <avg@FreeBSD.org>

PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and comments

Reviewed by: alc
MFC after: 4 days


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 1f93868d 13-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC elimination of several settings of PG_REFERENCED bit, that either
do not make sense or are harmful.

MFC r206761 (by alc):
Setting PG_REFERENCED on the requested page in swap_pager_getpages() is
either redundant or harmful, depending on the caller.

MFC r206768 (by alc):
In vm_object_backing_scan(), setting PG_REFERENCED on a page before
sleeping on that page is nonsensical.

MFC r206770 (by alc):
In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on
that page only makes sense if the advice is MADV_WILLNEED.

MFC r206801 (by alc):
There is no justification for vm_object_split() setting PG_REFERENCED on a
page that it is going to sleep on.


# e6ca6764 13-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207364:
In swap pager, do not free the non-requested pages from the run if they are
wired. Kstack pages are wired, this change prepares swap pager for handling
of long runs of kstack pages.


# db1f085e 09-May-2010 Alan Cox <alc@FreeBSD.org>

Call vm_page_deactivate() rather than vm_page_dontneed() in
swp_pager_force_pagein(). By dirtying the page, swp_pager_force_pagein()
forces vm_page_dontneed() to insert the page at the head of the inactive
queue, just like vm_page_deactivate() does. Moreover, because the page
was invalid, it can't have been mapped, and thus the other effect of
vm_page_dontneed(), clearing the page's reference bits has no effect. In
summary, there is no reason to call vm_page_dontneed() since its effect
will be identical to calling the simpler vm_page_deactivate().


# d061cdd5 08-May-2010 Alan Cox <alc@FreeBSD.org>

Remove the page queues lock around a call to vm_page_activate(). Make the
page dirty before adding it to the active queue.


# 3c4a2440 08-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.


# 97c38347 07-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary page queues locking.


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# e86a87e9 29-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

In swap pager, do not free the non-requested pages from the run if they are
wired. Kstack pages are wired, this change prepares swap pager for handling
of long runs of kstack pages.

Noted and reviewed by: alc
Tested by: pho
MFC after: 2 weeks


# 0b6ace47 17-Apr-2010 Alan Cox <alc@FreeBSD.org>

Setting PG_REFERENCED on the requested page in swap_pager_getpages() is
either redundant or harmful, depending on the caller. For example, when
called by vm_fault(), it is redundant. However, when called by
vm_thread_swapin(), it is harmful. Specifically, if the thread is later
swapped out, having PG_REFERENCED set on its stack pages leads the page
daemon to reactivate these stack pages and delay their reclamation.

Reviewed by: kib
MFC after: 3 weeks


# 8a9c731f 02-Nov-2009 Ivan Voras <ivoras@FreeBSD.org>

Add sysctl documentation strings. The descriptions are derived
from tuning(7). One of the descriptions references tuning(7) because
it is too complex to adequatly describe here (it is not a simple
boolean sysctl) and users should be warned to that.

Reviewed by: alc, kib
Approved by: gnn (mentor)


# 46aaa1ed 21-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198201:
Remove spurious call to priv_check(PRIV_VM_SWAP_NOQUOTA).
Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is
actually exceed.

Approved by: re (kensmith)


# 5c0e1c11 17-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Remove spurious call to priv_check(PRIV_VM_SWAP_NOQUOTA).
Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is
actually exceed.

Both changes aim at calling priv_check(9) only for the cases when
privilege is actually exercised by the process.

Reported and tested by: rwatson
Reviewed by: alc
MFC after: 3 days


# 7a8af8ee 24-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Initialize the uip to silence gcc warning that seems to sneak in in some
build environments.

Reported by: alc, bf1783 at googlemail com


# 3364c323 23-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 8eb5a1cd 28-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix typo.


# a80982c1 26-Apr-2009 Alan Cox <alc@FreeBSD.org>

Eliminate an errant comment.

Discussed with: tegge


# 016a3c93 24-Apr-2009 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary calls to pmap_clear_modify(). Specifically, calling
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access. Instead, assert that the page is not
mapped or not mapped for write access as appropriate.

Eliminate unnecessary clearing of a page's dirty mask. Instead, assert
that the page's dirty mask is clear.


# 9d13a605 20-Feb-2009 Alan Cox <alc@FreeBSD.org>

Eliminate stale comments.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 2025d69b 29-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Move the code for doing out-of-memory grass from vm_pageout_scan()
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.

Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.

Reviewed by: alc
Tested by: pho, jhb
MFC after: 2 weeks


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 3677ad36 30-Jul-2008 John Baldwin <jhb@FreeBSD.org>

If the kernel has run out of metadata for swap, then explicitly panic()
instead of emitting a warning before deadlocking.

MFC after: 1 month


# 11041003 11-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

Use the VM_ALLOC_INTERRUPT for the page requests when allocating memory
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.

Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.

Tested by: pho, kris
Reviewed by: alc
No objections from: phk
MFC after: 2 weeks (RELENG_7 only)


# c8c7ad92 05-May-2008 Kip Macy <kmacy@FreeBSD.org>

add malloc flag to blist so that it can be used in ithread context

Reviewed by: alc, bsdimp


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 35918c55 08-Jan-2008 Christian S.J. Peron <csjp@FreeBSD.org>

When MAC is enabled in the kernel, fix a panic triggered by a locking
assertion hit in swapoff_one() when we un-mount a swap partition. We
should be using curthread where we used thread0 before. This change
also replaces the thread argument with a credential argument, as the
MAC framework only requires the cred.

It should be noted that this allows the machine to be rebooted without
panicing with "cannot differ from curthread or NULL" when MAC is enabled.

Submitted by: rwatson
Reviewed by: attilio
MFC after: 2 weeks


# 7036145b 02-Nov-2007 Maxim Konovalov <maxim@FreeBSD.org>

o Fix panic message: it's swap_pager_putpages() not swap_pager_getpages().

Submitted by: Mark Tinguely


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# b5e8f167 05-Aug-2007 Alan Cox <alc@FreeBSD.org>

Consider a scenario in which one processor, call it Pt, is performing
vm_object_terminate() on a device-backed object at the same time that
another processor, call it Pa, is performing dev_pager_alloc() on the
same device. The problem is that vm_pager_object_lookup() should not be
allowed to return a doomed object, i.e., an object with OBJ_DEAD set,
but it does. In detail, the unfortunate sequence of events is: Pt in
vm_object_terminate() holds the doomed object's lock and sets OBJ_DEAD
on the object. Pa in dev_pager_alloc() holds dev_pager_sx and calls
vm_pager_object_lookup(), which returns the doomed object. Next, Pa
calls vm_object_reference(), which requires the doomed object's lock, so
Pa waits for Pt to release the doomed object's lock. Pt proceeds to the
point in vm_object_terminate() where it releases the doomed object's
lock. Pa is now able to complete vm_object_reference() because it can
now complete the acquisition of the doomed object's lock. So, now the
doomed object has a reference count of one! Pa releases dev_pager_sx
and returns the doomed object from dev_pager_alloc(). Pt now acquires
dev_pager_mtx, removes the doomed object from dev_pager_object_list,
releases dev_pager_mtx, and finally calls uma_zfree with the doomed
object. However, the doomed object is still in use by Pa.

Repeating my key point, vm_pager_object_lookup() must not return a
doomed object. Moreover, the test for the object's state, i.e.,
doomed or not, and the increment of the object's reference count
should be carried out atomically.

Reviewed by: kib
Approved by: re (kensmith)
MFC after: 3 weeks


# fe8606ac 24-Jun-2007 Alan Cox <alc@FreeBSD.org>

Eliminate GIANT_REQUIRED from swap_pager_putpages().

Approved by: re (mux)
MFC after: 1 week


# b4b70819 04-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 9e223287 31-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# d9135e72 23-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Audit pathnames looked up in swapon(2) and swapoff(2).

MFC after: 2 weeks
Obtained from: TrustedBSD Project


# 4d70511a 27-Feb-2007 John Baldwin <jhb@FreeBSD.org>

Use pause() rather than tsleep() on stack variables and function pointers.


# e8865caf 07-Feb-2007 John Baldwin <jhb@FreeBSD.org>

- Move 'struct swdevt' back into swap_pager.h and expose it to userland.
- Restore support for fetching swap information from crash dumps via
kvm_get_swapinfo(3) to fix pstat -T/-s on crash dumps.

Reviewed by: arch@, phk
MFC after: 1 week


# 663b416f 05-Jan-2007 John Baldwin <jhb@FreeBSD.org>

- Add a new function uma_zone_exhausted() to see if a zone is full.
- Add a printf in swp_pager_meta_build() to warn if the swapzone becomes
exhausted so that there's at least a warning before a box that runs out
of swapzone space before running out of swap space deadlocks.

MFC after: 1 week
Reviwed by: alc


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 66bdd5d6 22-Oct-2006 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_wakeup().


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# ab1661cb 05-Aug-2006 Alan Cox <alc@FreeBSD.org>

Remove a stale comment.


# 91449ce9 03-Aug-2006 Alan Cox <alc@FreeBSD.org>

When sleeping on a busy page, use the lock from the containing object
rather than the global page queues lock.


# 61f73c79 10-May-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Use better order here.


# 0909f38a 10-Apr-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

On shutdown try to turn off all swap devices. This way GEOM providers are
properly closed on shutdown.

Requested by: ru
Reviewed by: alc
MFC after: 2 weeks


# 62a59e8f 07-Mar-2006 Warner Losh <imp@FreeBSD.org>

Remove leading __ from __(inline|const|signed|volatile). They are
obsolete. This should reduce diffs to NetBSD as well.


# 100650de 27-Jan-2006 Olivier Houchard <cognet@FreeBSD.org>

Make sure b_vp and b_bufobj are NULL before calling relpbuf(), as it asserts
they are. They should be NULL at this point, except if we're coming from
swapdev_strategy().
It should only affect the case where we're swapping directly on a file over
NFS.


# 3cfc7651 21-Sep-2005 Olivier Houchard <cognet@FreeBSD.org>

Make sure we have a bufobj before calling bstrategy().
I'm not sure this is the right thing to do, but at least I don't panic
anymore when swapping on a NFS file without using md(4).

X-MFC after: proper review


# ec9c9e73 20-Jul-2005 Alan Cox <alc@FreeBSD.org>

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


# 071a1710 20-May-2005 Alan Cox <alc@FreeBSD.org>

Reduce the number of times that we acquire and release locks in
swap_pager_getpages().

MFC after: 1 week


# 1aececdb 19-May-2005 Alan Cox <alc@FreeBSD.org>

Remove calls to spl*().


# 2e2a6fa2 18-May-2005 Alan Cox <alc@FreeBSD.org>

Revert revision 1.270: swp_pager_async_iodone() need not perform
VM_LOCK_GIANT().

Discussed with: jeff


# 382a601c 30-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- VM_LOCK_GIANT in the swap pager's iodone routine as VFS will soon call it
without Giant.

Sponsored by: Isilon Systems, Inc.


# 7625cbf3 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Pass the ISOPEN flag to namei so filesystems will know we're about to
open them or otherwise access the data.


# 010b1ca1 18-Mar-2005 David Schultz <das@FreeBSD.org>

Move the swap_zone == NULL check earlier (i.e. before we dereference
the pointer.)

Found by: Coverity Prevent analysis tool


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 4f8205e5 03-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

When allocating bio's in the swap_pager use M_WAITOK since the
alternative is much worse.


# 9799b417 19-Nov-2004 David Schultz <das@FreeBSD.org>

Disable U area swapping and remove the routines that create, destroy,
copy, and swap U areas.

Reviewed by: arch@


# 8bc61209 06-Nov-2004 David Schultz <das@FreeBSD.org>

Fix the last known race in swapoff(), which could lead to a spurious panic:

swapoff: failed to locate %d swap blocks

The race occurred because putpages() can block between the time it
allocates swap space and the time it updates the swap metadata to
associate that space with a vm_object, so swapoff() would complain
about the temporary inconsistency. I hoped to fix this by making
swp_pager_getswapspace() and swp_pager_meta_build() a single atomic
operation, but that proved to be inconvenient. With this change,
swapoff() simply doesn't attempt to be so clever about detecting when
all the pageout activity to the target device should have drained.


# b3fed13e 04-Nov-2004 David Schultz <das@FreeBSD.org>

Close a race in swapoff(). Here are the gory details:

In order to avoid livelock, swapoff() skips over objects with a
nonzero pip count and makes another pass if necessary. Since it is
impossible to know which objects we care about, it would choose an
arbitrary object with a nonzero pip count and wait for it before
making another pass, the theory being that this object would finish
paging about as quickly as the ones we care about. Unfortunately,
we may have slept since we acquired a reference to this object.
Hack around this problem by tsleep()ing on the pointer anyway, but
timeout after a fixed interval. More elegant solutions are possible,
but the ones I considered unnecessarily complicate this rare case.

Also, kill some nits that seem to have crept into the swapoff() code
in the last 75 revisions or so:

- Don't pass both sp and sp->sw_used to swap_pager_swapoff(), since
the latter can be derived from the former.

- Replace swp_pager_find_dev() with something simpler. There's no
need to iterate over the entire list of swap devices just to determine
if a given block is assigned to the one we're interested in.

- Expand the scope of the swhash_mtx in a couple of places so that it
isn't released and reacquired once for every hash bucket.

- Don't drop the swhash_mtx while holding a reference to an object.
We need to lock the object first. Unfortunately, doing so would
violate the established lock order, so use VM_OBJECT_TRYLOCK() and
try again on a subsequent pass if the object is already locked.

- Refactor swp_pager_force_pagein() and swap_pager_swapoff() a bit.


# c5d3d25e 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

De-couple our I/O bio request from the embedded bio in buf by explicitly
copying the fields.


# c5690651 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove buf->b_dev field.


# b792bebe 24-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# 494eb176 22-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# a76d8f4e 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# f6bcadc4 24-Sep-2004 David Schultz <das@FreeBSD.org>

Don't look for swap blocks in objects that aren't swap-backed.
I expect that this will fix the following panic, reported by Jun:
swap_pager_isswapped: failed to locate all swap meta blocks

MT5 candidate


# 5721c9c7 08-Aug-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Tag all geom classes in the tree with a version number.


# 5285558a 22-Jul-2004 Alan Cox <alc@FreeBSD.org>

- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:

- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)


# 9bd86a98 05-Jul-2004 Bruce M Simpson <bms@FreeBSD.org>

Properly brucify a string by outdenting it.


# 0e3fe6e3 23-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

In swap_pager_getpages(), bp->b_dev can be NULL, particularly for the
case of NFS mounted swap, so do not try to dereference it.

While we're here, brucify the printf() call which happens when we
time out on acquisition of vm_page_queue_mtx.

PR: kern/67898
Submitted by: bde (style)


# f3732fd1 17-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 5a324893 05-May-2004 Alan Cox <alc@FreeBSD.org>

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


# 2c840b1f 22-Feb-2004 Alan Cox <alc@FreeBSD.org>

- Substitute bdone() and bwait() from vfs_bio.c for
swap_pager_putpages()'s buffer completion code. Note: the only
difference between swp_pager_sync_iodone() and bdone(), aside from
the locking in the latter, was the unnecessary clearing of B_ASYNC.
- Remove an unnecessary pmap_page_protect() from
swp_pager_async_iodone().

Reviewed by: tegge


# d2bae332 12-Feb-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the absolute count g_access_abs() function since experience has
shown that it is not useful.

Rename the relative count g_access_rel() function to g_access(), only
the name has changed.

Change all g_access_rel() calls in our CVS tree to call g_access() instead.

Add an #ifndef BURN_BRIDGES #define of g_access_rel() for source
code compatibility.


# c5aebf38 07-Feb-2004 Alan Cox <alc@FreeBSD.org>

swp_pager_async_iodone() no longer requires Giant. Modify bufdone()
and swapgeom_done() to perform swp_pager_async_iodone() without Giant.

Reviewed by: tegge


# 3e5b6861 02-Feb-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Check error return from g_clone_bio(). (netchild@)

Add XXX comment about why this is still not optimal. (phk@)

Submitted by: netchild@


# 7dea2c2e 24-Jan-2004 Alan Cox <alc@FreeBSD.org>

1. Statically initialize swap_pager_full and swap_pager_almost_full to the
full state. (When swap is added their state will change appropriately.)
2. Set swap_pager_full and swap_pager_almost_full to the full state when
the last swap device is removed.
Combined these changes eliminate nonsense messages from the kernel on swap-
less machines.

Item 2 submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
Prodding by: phk


# 2f7af3db 04-Jan-2004 Alan Cox <alc@FreeBSD.org>

Simplify the various pager allocation routines by computing the desired
object size once and assigning that value to a local variable.


# e793b779 03-Jan-2004 Alan Cox <alc@FreeBSD.org>

Reduce the scope of Giant in swap_pager_alloc().


# bd228075 28-Dec-2003 Alan Cox <alc@FreeBSD.org>

Remove swap_pager_un_object_list; it is unused.


# c7c8dd7e 01-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Modify swap_pager_copy() and its callers such that the source and
destination objects are locked on entry and exit. Add comments to
the callers noting that the locks can be released by swap_pager_copy().
- Remove several instances of GIANT_REQUIRED.


# 2928cef7 30-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to the swdevt's sw_flags with sw_dev_mtx.
- Remove several instances of GIANT_REQUIRED.


# 7645e885 30-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to the swdevt's sw_blist with sw_dev_mtx.
- Remove several instances of GIANT_REQUIRED.


# d05bc129 30-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to swdevhd using sw_dev_mtx.
- Use swp_sizecheck() rather than assignment to swap_pager_full in
swaponsomething().


# 0676a140 29-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize updates to nswapdev using sw_dev_mtx.


# 2d9974c1 28-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Avoid a race in swaponsomething(): Calculate the new swdevt's first and
end swblk and insert this new swdevt into the list of swap devices
in the same critical section.


# d536c58f 26-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Complete the synchronization of accesses to the swblock hash table.


# 7827d9b0 26-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Introduce and use a mutex synchronizing access to the swblock hash table.


# ee3dc7d7 25-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Add some of the required vm object locking, including assertions where
the vm object lock is required and already held.


# 2e3b314d 24-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Push down Giant from vm_pageout() to vm_pageout_scan(), freeing
vm_pageout_page_stats() from Giant.
- Modify vm_pager_put_pages() and vm_pager_page_unswapped() to expect the
vm object to be locked on entry. (All of the pager routines now expect
this.)


# 2c18019f 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

DuH!

bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)


# 9fbf91c0 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY().
Remove stale comment about B_PHYS.


# afeb65e6 01-Sep-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Don't open with exclusive bit, swapon(8) wants to trash our swapdev.

Add XXX comment with a rating of this concept.


# dee34ca4 30-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a close() method to a swapdev.

Add a GEOM based backend.

Remove the device/VOP_SPECSTRATEGY() based backend.


# 20da9c2e 30-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Protect the swapdevice tailq with a mutex.

Store the udev_t we will report to userland in the swdevt.


# 59efee01 30-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Continue the objectification of the swapdev backends:

Remove the vnode and dev_t fields and replace them with a void *.

Introduce separate strategy functions for devices and regular (NFS)
vnodes.

For devices we don't need the vnode v_numoutput stuff.

Add a generic swaponsomething() function to add a swapdevice and
split the remainder of swaponvp() into swaponvp() and swapondev()
which calls this backend.


# 4b03903a 30-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Make the strategy function a method of the individual swapdev.


# 2f249180 30-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Consistent use modern function definitions


# 395714fe 15-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unnecessary udev_t variable: we can derive it from the dev_t
when we need it.


# 89dc784f 14-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Make swaponvp() static to the swap_pager.


# ef3c5abd 06-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Make the first two pages magic to protect the BSD labels rather than
only one.


# 751221fd 05-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize swap_pager_putpages()

Eliminate a lot of checkes to make sure requests are not cross-device
which is unnecessary with the new layout. We know a sequential request
cannot possibly be cross-device because there is a reserved page between
the devices.

Remove a couple of comments which no longer are relevant.


# 5e04322a 06-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Explicitly set B_PAGING


# c37a77ee 06-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Rip out the totally bogos vnode swapdev_vp with extreeme prejudice.

Don't mark buffers with B_KEEPGIANT, we don't drop giant in strategy
at this point in time.


# e04e4bac 05-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Use sparse struct initialization for struct pagerops.

Mark our buffers B_KEEPGIANT before sending them downstream.

Remove swap_pager_strategy implementation.


# 665c0caf 04-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Put an uncovered page between the swap devices, that way we can be sure
to not get any cross-device I/O requests. (The unallocated first page
protecting BSD labels already gave us this, but that hack may go away
at some point in time).

Remove the check for cross-device I/O requests in swap_pager_strategy.

Move the repeated statistics updating into flushchainbuf().


# 12692209 03-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Name swap_pager_find_dev() more correctly swp_pager_finde_dev().

Use ->bio_children to count child buffers, rather than abuse the
bio_caller1 pointer.

Expand the relevant bits of waitchainbuf() inline, this clarifies
the code a little bit.


# 5ff0108d 03-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

I accidentally hit undo before committing, fix the resulting off-by-one.


# 8f60c087 03-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Change the layout policy of the swap_pager from a hardcoded width
striping to a per device round-robin algorithm.

Because of the policy of not attempting to retain previous swap
allocation on page-out, this means that a newly added swap device
almost instantly takes its 1/N share of the I/O load but it takes
somewhat longer for it to assume it's 1/N share of the pages if there
is plenty of space on the other devices.

Change the 8G total swapspace limitation to 8G per device instead
by using a per device blist rather than one global blist. This
reduces the memory footprint by 75% (typically a couple hundred
kilobytes) for the common case with one swapdevice but NSWAPDEV=4.

Remove the compile time constant limit of number of swap devices,
there is no limit now. Instead of a fixed size array, store the
per swapdev structure in a TAILQ.

Total swap space is still addressed by a 32 bit page number and
therefore the upper limit is now 2^42 bytes = 16TB (for i386).

We still do not allocate the first page of each device in order to
give some amount of protection to any bsdlabel at the start of the
device.

A new device is appended after the existing devices in the swap space,
no attempt is made to fill in holes left behind by swapoff (this can
trivially be changed should it ever become a problem).

The sysctl vm.nswapdev now reflects the number of currently configured
swap devices.

Rename vm_swap_size to swap_pager_avail for consistency with other
exported names.

Change argument type for vm_proc_swapin_all() and swap_pager_isswapped()
to be a struct swdevt pointer rather than an index.

Not changed: we are still using blists to manage the free space,
but since the swapspace is no longer fragmented by the striping
different resource managers might fare better.


# 8d677ef9 31-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused stuff.

Move used stuff to swap_pager.c where it belongs.

This file no longer exports anything to userland.


# a8d43c90 26-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a "int fd" argument to VOP_OPEN() which in the future will
contain the filedescriptor number on opens from userland.

The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*

For now pass -1 all over the place.


# a5edd34a 22-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove all but one of the inlines here, this reduces the code size by
2032 bytes and has no measurable impact on performance.


# da5fd145 22-Jul-2003 Peter Wemm <peter@FreeBSD.org>

swp_pager_hash() was called before it was instantiated inline. This made
gcc (quite rightly) unhappy. Move it earlier.


# 85fdafb9 18-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a printf format warning I introduced.
Use the macro max number of swap devices rather than cache the constant
in a variable.
Avoid a (now) pointless variable.


# d3dd89ab 18-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

If a proposed swap device exceeds the 8G artificial limit which out
radix-tree code imposes, truncate the device instead of rejecting it.


# ec38b344 18-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Move the implementation of the vmspace_swap_count() (used only in
the "toss the largest process" emergency handling) from vm_map.c to
swap_pager.c.

The quantity calculated depends strongly on the internals of the
swap_pager and by moving it, we no longer need to expose the
internal metrics of the swap_pager to the world.


# 567104a1 18-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a new function swap_pager_status() which reports the total size of the
paging space and how much of it is in use (in pages).

Use this interface from the Linuxolator instead of groping around in the
internals of the swap_pager.


# e9c0cc15 18-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Merge swap_pager.c and vm_swap.c into swap_pager.c, the separation
is not natural and needlessly exposes a lot of dirty laundry.

Move private interfaces between the two from swap_pager.h to swap_pager.c
and staticize as much as possible.

No functional change.


# 116b3c2a 17-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Make sure that SWP_NPAGES always has the same value in all source
files, so that SWAP_META_PAGES does not vary either.

swap_pager.c ended up with a value of 16, everybody else 8. Go with
the 16 for now.

This should only have any effect in the "kill processes because we
are out of swap" scenario, where it will make some sort of estimate
of something more precise.


# dd5e55f8 24-Jun-2003 Alan Cox <alc@FreeBSD.org>

Maintain the lock on a vm object when calling vm_page_grab().


# 5ea4972c 20-Jun-2003 Alan Cox <alc@FreeBSD.org>

Make swap_pager_haspages() static; remove unused function prototypes.


# b94b853b 16-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

This file was ignored by CVS in my last commit for some reason:

Remove pointless initialization of b_spc field, which now no longer
exists.


# 33a609ec 13-Jun-2003 Alan Cox <alc@FreeBSD.org>

Extend the scope of the vm object lock in swp_pager_async_iodone() to cover
a vm_page_free().


# 8630c117 12-Jun-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to various pagers' "get pages" methods, i386 stack
management functions, and a u area management function.


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 19ba4c8e 07-Jun-2003 Alan Cox <alc@FreeBSD.org>

Assert that the vm object is locked on entry to swap_pager_freespace().


# 658ad5ff 05-May-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object when performing vm_pager_deallocate().


# 17cd3642 28-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing swap_pager_isswapped().
- Assert that the vm_object is locked in swap_pager_isswapped().


# 1ca58953 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


# d68d828b 20-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_add().
- Remove an unnecessary variable.


# d22bc710 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_add().


# 0fa05eae 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_subtract().
- Assert that the vm_object lock is held in vm_object_pip_subtract().


# 0d420ad3 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_wakeupn().
- Assert that the vm_object lock is held in vm_object_pip_wakeupn().
- Add a new macro VM_OBJECT_LOCK_ASSERT().


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# c410df59 03-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Avoid extern decls in .c files by putting them in the vm/swap_pager.h
include file where they belong.
Share the dmmax_mask variable.


# 86270230 02-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


# b365ea9e 17-Dec-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing vm_page_flag_set().


# 92da00bb 15-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

This is David Schultz's swapoff code which I am finally able to commit.
This should be considered highly experimental for the moment.

Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 3 weeks


# a12cc0e4 17-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove vm_page_protect(). Instead, use pmap_page_protect() directly.


# f64e99ba 11-Nov-2002 Olivier Houchard <cognet@FreeBSD.org>

Remove extra #include<sys/vmmeter.h>.


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 6a2eac8a 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Lock access to numoutput on the swap devices.


# ec61f55d 31-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Reduce the maximum KVA reserved for swap meta structures from 70 to 32 MB.
Reduce the swap meta calculation by a factor of 2, it's still massive overkill.

X-MFC after: immediately


# ab9abe5d 21-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_free().


# 40eab1e9 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_try_to_cache(). (The accesses
in kern/vfs_bio.c are already locked.)
o Assert that the page queues lock is held in vm_page_try_to_cache().


# 23f09d50 26-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Avoid using the 64-bit vm_pindex_t in a few places where 64-bit
types are not required, as the overhead is unnecessary:

o In the i386 pmap_protect(), `sindex' and `eindex' represent page
indices within the 32-bit virtual address space.
o In swp_pager_meta_build() and swp_pager_meta_ctl(), use a temporary
variable to store the low few bits of a vm_pindex_t that gets used
as an array index.
o vm_uiomove() uses `osize' and `idx' for page offsets within a
map entry.
o In vm_object_split(), `idx' is a page offset within a map entry.


# 5125fe4f 26-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Use an explicit cast to avoid relying on sign extension to do the
right thing in code such as `vm_pindex_t x = ~SWAP_META_MASK'.

Reviewed by: dillon


# 24c46d03 22-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Replace GIANT_REQUIRED in swap_pager_alloc() by the acquisition and
release of Giant. (Annotate as MPSAFE.)


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 670d17b5 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.


# 11caded3 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# fdcc1cc0 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required. However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.


# 57c10583 22-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.


# d6844b6b 15-Oct-2001 Tor Egge <tegge@FreeBSD.org>

Don't use an uninitialized field reserved for callers in the bio structure
passed to swap_pager_strategy(). Instead, use a field reserved for drivers
and initialize it before usage.

Reviewed by: dillon


# bd78cece 11-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# 2f9e4e80 19-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Limit the amount of KVM reserved for the buffer cache and for swap-meta
information. The default limits only effect machines with > 1GB of ram
and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX
and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and
kern.maxbcache. This has the effect of leaving more KVM available for
sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad
adds memory to a machine and then sees the kernel panic on boot due to
running out of KVM.

Also change the default swap-meta auto-sizing calculation to allocate half
of what it was previously allocating. The prior defaults were way too high.
Note that we cannot afford to run out of swap-meta structures so we still
stay somewhat conservative here.


# 61ce6eee 02-Aug-2001 Alfred Perlstein <alfred@FreeBSD.org>

Fixups for the initial allocation by dillon:
1) allocate fewer buckets
2) when failing to allocate swap zone, keep reducing the zone by
a third rather than a half in order to reduce the chance of
allocating way too little.

I also moved around some code for readability.

Suggested by: dillon
Reviewed by: dillon


# 54d92145 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

whitespace / register cleanup


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 6d541bf1 22-Jun-2001 John Baldwin <jhb@FreeBSD.org>

- Protect all accesses to nsw_[rw]count{,_{,a}sync} with the pbuf mutex.
- Don't drop the vm mutex while grabbing the pbuf mutex to manipulate
said variables.


# b608320d 23-May-2001 John Baldwin <jhb@FreeBSD.org>

- Fix the sw_alloc_interlock to actually lock itself when the lock is
acquired.
- Assert Giant is held in the strategy, getpages, and putpages methods and
the getchainbuf, flushchainbuf, and waitchainbuf functions.
- Always call flushchainbuf() w/o the VM lock.


# c5e62505 23-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

aquire Giant when playing with the buffercache and doing IO.
use msleep against the vm mutex while waiting for a page IO to complete.


# 240e0fdd 22-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

aquire vm mutex in swp_pager_async_iodone. Don't call swp_pager_async_iodone
with the mutex held.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# a468031c 06-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Actually biofinish(struct bio *, struct devstat *, int error) is more general
than the bioerror().

Most of this patch is generated by scripts.


# a9fa2c05 18-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

Protect pager object creation with sx locks.

Protect pager object list manipulation with a mutex.

It doesn't look possible to combine them under a single sx lock because
creation may block and we can't have the object list manipulation block
on anything other than a mutex because of interrupt requests.


# 2a758ebe 13-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

protect pbufs and associated counts with a mutex


# edfa785a 23-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

Introduce per-swap area accounting in the VM system, and export
this information via the vm.nswapdev sysctl (number of swap areas)
and vm.swapdevX nodes (where X is the device), which contain the MIBs
dev, blocks, used, and flags. These changes are required to allow
top and other userland swap-monitoring utilities to run without
setgid kmem.

Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: freebsd-audit


# 21cd6e62 13-Dec-2000 Seigo Tanimura <tanimura@FreeBSD.org>

- If swap metadata does not fit into the KVM, reduce the number of
struct swblock entries by dividing the number of the entries by 2
until the swap metadata fits.

- Reject swapon(2) upon failure of swap_zone allocation.

This is just a temporary fix. Better solutions include:
(suggested by: dillon)

o reserving swap in SWAP_META_PAGES chunks, and
o swapping the swblock structures themselves.

Reviewed by: alfred, dillon


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# cee313c4 19-Nov-2000 Robert Watson <rwatson@FreeBSD.org>

o Export dmmax ("Maximum size of a swap block") using SYSCTL_INT.
This removes a reason that systat requires setgid kmem. More to
come.


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 279d7226 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 64bcb9c8 13-Oct-2000 Matthew Dillon <dillon@FreeBSD.org>

The swap bitmap allocator was not calculating the bitmap size properly
in the face of non-stripe-aligned swap areas. The bug could cause a
panic during boot.

Refuse to configure a swap area that is too large (67 GB or so)

Properly document the power-of-2 requirement for SWB_NPAGES.

The patch is slightly different then the one Tor enclosed in the P.R.,
but accomplishes the same thing.

PR: kern/20273
Submitted by: Tor.Egge@fast.no


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 0b441832 03-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Convert the vm_pager_strategy() interface to take a struct bio instead of
a struct buf. Don't try to examine B_ASYNC, it is a layering violation
to do so. The only current user of this interface is vn(4) which, since
it emulates a disk interface, operates on struct bio already.


# e4057dbd 01-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move and staticize the bufchain functions so they become local to the
only piece of code using them. This will ease a rewrite of them.


# 8177437d 14-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 25db2c54 27-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

Add necessary spl protection for swapper. The problem was located by
Alfred while testing his SPLASSERT stuff. This is not a complete fix,
more protections are probably needed.


# 5929bcfa 27-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 956f3135 26-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Spelling


# 912e4ae9 22-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Fix one place which knew that B_WRITE was zero.

Fix a stylistic mistake of mine while here.

Found by: Stephen Hocking <shocking@prth.pgs.com>


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# db5f635a 16-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


# ea3aecf5 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Fix the swap backed vn case - this was broken by my rev 1.128 to
swap_pager.c and related commits.

Essentially swap_pager.c is backed out to before the changes, but
swapdev_vp is converted into a real vnode with just VOP_STRATEGY().
It no longer abuses specfs vnops and no longer needs a dev_t and
/dev/drum (or /dev/swapdev) for the intermediate layer.

This essentially restores the vnode interface as the interface to the
bottom of the swap pager, and vm_swap.c provides a clean vnode interface.

This will need to be revisited when we swap to files (vnodes) - which
is the other reason for keeping the vnode interface between the swap pager
and the swap devices.

OK'ed by: dillon


# 24e7ab7c 22-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Isolate the swapdev_vp "not quite" vnode in the only source file which
needs it now that /dev/drum is gone.

Reviewed by: eivind, peter


# cdacc6ab 17-Nov-1999 Peter Wemm <peter@FreeBSD.org>

Remove the non-functional "swap device" userland front-end to the
multiplexed underlying swap devices (/dev/drum). The only thing it did
was to allow root to open /dev/drum, but not do anything with it.
Various utilities used to grovel around in here, but Matt has written
a much nicer (and clean) front-end to this for libkvm, and nothing uses
the old system any more.

The VM system was calling VOP_STRATEGY() on the vp of the first underlying
swap device (not the /dev/drum one, the first real device), and using
the VOP system to indirectly (and only) call swstrategy() to choose
an underlying device and enqueue it on that device. I have changed it
to avoid diverting through the VOP system and to call the only possible
target directly, saving a little bit of time and some complexity.

In all, nothing much changes, except some scaffolding to support the
roundabout way of calling swstrategy() is gone.

Matt gave me the ok to do this some time ago, and I apologize for taking
so long to get around to it.


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 4dcc5c2d 16-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix a number of spl bugs related to reserving and freeing swap space.
Swap space can be freed from an interrupt and so swap reservation and
freeing must occur at splvm.

Add swap_pager_reserve() code to support a new swap pre-reservation
capability for the VN device.

Generally cleanup the swap code by simplifying the swp_pager_meta_build()
static function and consolidating the SWAPBLK_NONE test from a bit test
to an absolute compare. The bit test was left over from a rejected
swap allocation scheme that was not ultimately committed. A few other
minor cleanups were also made.

Reorganize the swap strategy code, again for VN support, to not
reallocate swap when writing as this messes up pre-reservation and
can fragment I/O unnecessarily as VN-baesd disk is messed around with.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# af647dde 23-Aug-1999 Bruce Evans <bde@FreeBSD.org>

Use devtoname to print dev_t's instead of casting them to u_long for
misprinting with %lx.

Cast pointers to intptr_t instead of casting them to long. Cosmetic.


# c52e7044 16-Aug-1999 Alan Cox <alc@FreeBSD.org>

Correct an accidental omission of one "vm_page_undirty" replacement
from the previous commit.


# 2c28a105 16-Aug-1999 Alan Cox <alc@FreeBSD.org>

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


# 9b21395a 15-Jul-1999 Alan Cox <alc@FreeBSD.org>

Remove vm_object::last_read. It is used by the old swap pager, but
not by the new one, i.e., vm/swap_pager.c rev 1.108.

Reviewed by: dillon@backplane.com


# b890cb2c 27-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Kirk missed a required BUF_KERNPROC(). Even though this is a non-async
transfer, the b_iodone hook causes biodone() to release it from interrupt
context.


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# b0eeea20 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# a5296b05 14-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Submitted by: Matt Dillon <dillon@freebsd.org>
The old VN device broke in -4.x when the definition of B_PAGING
changed. This patch fixes this plus implements additional capabilities.
The new VN device can be backed by a file ( as per normal ), or it can
be directly backed by swap.

Due to dependencies in VM include files (on opt_xxx options) the new
vn device cannot be a module yet. This will be fixed in a later commit.
This commit delimitted by tags {PRE,POST}_MATT_VNDEV


# ad3cce20 21-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove conditional sysctl's

Leave swap_async_max sysctl intact, remove swap_cluster_max sysctl.

Reviewed by: Alan Cox <alc@cs.rice.edu>


# 20d3034f 21-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Reviewed by: Alan Cox <alc@cs.rice.edu>

Fix problem w/ low-swap/low-memory handling as reported by Bruce Evans.


# 327f4e83 18-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Limit number of simultanious asynchronous swap pager I/Os that can
be in progress at any given moment.

Add two swap tuneables to sysctl:

vm.swap_async_max: 4
vm.swap_cluster_max: 16

Recommended values are a cluster size of 8 or 16 pages. async_max is
about right for 1-4 swap devices. Reduce to 2 if swap is eating too much
bandwidth, or even 1 if swap is both eating too much bandwidth and sitting
on a slow network (10BaseT).

The defaults work well across a broad range of configurations and should
normally be left alone.


# 2b0d37a4 06-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Add hysteresis to the 'swap_pager_getswapspace; failed' console message.
Also widen the hysteresis levels a little ( these really should be
dynamically configured ).


# 5e24f1a2 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove unintended trigraph sequences in comments for -Wall


# 7dbf82dc 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Change all manual settings of vm_page_t->dirty = VM_PAGE_BITS_ALL
to use the vm_page_dirty() inline.

The inline can thus do sanity checks ( or not ) over all cases.


# e4542174 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

vm_pager_put_pages() is passed an rcval array to hold per-page return
values. The 'int' return value for the procedure was never used and
not well defined in any case when there are mixed errors on pages, so
it has been removed. vm_pager_put_pages() and associated vm_pager
functions now return void.


# 9f6fed90 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

The default_pager's interaction with the swap_pager has been reorganized,
and the swap_pager has been completely replaced.

The new swap pager uses the new blist radix-tree based bitmap allocator
for low level swap allocation and deallocation. The new allocator
is effectively O(5) while the old one was O(N), and the new allocator
allocates all required memory at init time rather then at allocate
memory on the fly at run time.

Swap metadata is allocated in clusters and stored in a hash table,
eliminating linearly allocated structures.

Many, many features have been rewritten or added. Swap space is now
reallocated on the fly providing a poor-mans auto defragmentation of
swap space. Swap space that is no longer needed is freed on a timely
basis so no garbage collection is necessary.

Swap I/O is marked B_ASYNC and NFS has been fixed to do the right
thing with it, so NFS-based paging now has around 10x the performance
as it did before ( previously NFS enforced synchronous I/O for paging ).


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 7a917245 29-Dec-1998 Dmitrij Tejblum <dt@FreeBSD.org>

Don't free swap in swap_pager_getpages(): this code probably cause the
"dying daemons" problem. (I thought this code was introduced in rev.1.80,
but it just relaxed the condition.)

Also, kill related "suggest more swap space" warning (also introduced in
1.80). It was confusing, to say the least...

Requested by: msmith
Not objected by: dg


# 04258de3 18-Nov-1998 Bruce Evans <bde@FreeBSD.org>

Fixed a null pointer panic in spc_free(). swap_pager_putpages()
almost always causes this panic for the curproc != pageproc case.
This case apparently doesn't happen in normal operation, but it
happens when vm_page_alloc_contig() is called when there is a memory
hogging application that hasn't already been paged out.

PR: 8632
Reviewed by: info@opensound.com (Dev Mazumdar), dg
Broken in: rev.1.89 (1998/02/23)


# 40c8cfe5 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


# 6cde7a16 13-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 196e9a52 13-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Protect all modifications to paging_in_progress with splvm().


# ccbbd927 28-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed two spl nesting bugs. They caused (at least) the entire pageout
daemon to run at splvm() forever after swap_pager_putpages() is called
from vm_pageout_scan().

Broken in: rev.1.189 (1998/02/23)


# ac1e407b 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# fd5d1124 04-Jul-1998 Julian Elischer <julian@FreeBSD.org>

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


# cbd8ec09 03-May-1998 John Dyson <dyson@FreeBSD.org>

Work around some VM bugs, the worst being an overly aggressive
swap space free calculation. More complete fixes will be forthcoming,
in a week.


# c0877f10 28-Apr-1998 John Dyson <dyson@FreeBSD.org>

Tighten up management of memory and swap space during map allocation,
deallocation cycles. This should provide a measurable improvement
on swap and memory allocation on loaded systems. It is unlikely a
complete solution. Also, provide more map info with procfs.
Chuck Cranor spurred on this improvement.


# c1087c13 15-Apr-1998 Bruce Evans <bde@FreeBSD.org>

Support compiling with `gcc -ansi'.


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# 66095752 24-Feb-1998 John Dyson <dyson@FreeBSD.org>

Fix page prezeroing for SMP, and fix some potential paging-in-progress
hangs. The paging-in-progress diagnosis was a result of Tor Egge's
excellent detective work.
Submitted by: Partially from Tor Egge.


# e47ed70b 23-Feb-1998 John Dyson <dyson@FreeBSD.org>

Significantly improve the efficiency of the swap pager, which appears to
have declined due to code-rot over time. The swap pager rundown code
has been clean-up, and unneeded wakeups removed. Lots of splbio's
are changed to splvm's. Also, set the dynamic tunables for the
pageout daemon to be more sane for larger systems (thereby decreasing
the daemon overheadla.)


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# e736cd05 02-Feb-1998 John Dyson <dyson@FreeBSD.org>

This fix should help the panic problems in -current. There
were some errors in "interval" management. Due to the
clustering mechanism, the code is necessarily complex and
error prone.


# 1f13bdaa 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix a performance problem caused by an earlier commit.


# eaf13dd7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# b44e4b7a 24-Dec-1997 John Dyson <dyson@FreeBSD.org>

Support running with inadequate swap space. Additionally, the code
will complain with a suggestion of increasing it.


# ab3f7469 02-Dec-1997 Poul-Henning Kamp <phk@FreeBSD.org>

In all such uses of struct buf: 's/b_un.b_addr/b_data/g'


# 79624e21 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# dfeca1b8 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Print a device number in hex instead of decimal.


# b9dcd593 25-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Fixed type mismatches for functions with args of type vm_prot_t and/or
vm_inherit_t. These types are smaller than ints, so the prototypes
should have used the promoted type (int) to match the old-style function
definitions. They use just vm_prot_t and/or vm_inherit_t. This depends
on gcc features to work. I fixed the definitions since this is easiest.
The correct fix may be to change the small types to u_int, to optimize
for time instead of space.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 9b5a5d81 11-Jan-1997 John Dyson <dyson@FreeBSD.org>

Prepare better for multi-platform by eliminating another required
pmap routine (pmap_is_referenced.) Upper level recoded to use
pmap_ts_referenced.


# 8ba0c490 12-Oct-1996 Bruce Evans <bde@FreeBSD.org>

Removed __pure's and __pure2's. __pure is a no-op for recent versions
of gcc by definition, and __pure2 is a no-op in effect (presumably the
compiler can see when an inline function has no side effects).


# 5070c7f8 08-Sep-1996 John Dyson <dyson@FreeBSD.org>

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 3091ee09 09-Jun-1996 John Dyson <dyson@FreeBSD.org>

Mostly superficial code improvements, add a diagnostic. The
code improvements include significant simplification of the reservation
of the swap pager control blocks for reads. Add a panic for an inconsistent
swap pager control block count.


# 0a47b48b 22-May-1996 John Dyson <dyson@FreeBSD.org>

Initial support for MADV_FREE, support for pages that we don't care
about the contents anymore. This gives us alot of the advantage of
freeing individual pages through munmap, but with almost none of the
overhead.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# aa8de40a 03-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Another sweep over the pmap/vm macros, this time with more focus on
the usage. I'm not satisfied with the naming, but now at least there is
less bogus stuff around.


# e911eafc 02-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

removed:
CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei()
ptei() kvtopte() ptetov() ispt() ptetoav() &c &c
new:
NPDEPG

Major macro cleanup.


# 45952afc 05-Mar-1996 John Dyson <dyson@FreeBSD.org>

Fix a problem in the swap pager that caused some of the pages that
were paged in under low swap space conditions to both loose their
backing store and their dirty bits. This would cause pages to
be demand zeroed under certain conditions in low VM space conditions
and consequential sig-11's or sig-10's. This situation was made
worse lately when the level for swap space reclaim threshold was
increased.


# 836e5d13 03-Mar-1996 John Dyson <dyson@FreeBSD.org>

In order to fix some concurrency problems with the swap pager early
on in the FreeBSD development, I had made a global lock around the
rlist code. This was bogus, and now the lock is maintained on a
per resource list basis. This now allows the rlist code to be used for
almost any non-interrupt level application.


# de5f6a77 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


# 1af87c92 31-Jan-1996 David Greenman <dg@FreeBSD.org>

"out of space" -> "out of swap space".


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# f2c6b65b 17-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Fixed 1TB filesize changes. Some pindexes had bogus names and types
but worked because vm_pindex_t is indistinuishable from vm_offset_t.


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# 87b6de2b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# cb6962cd 11-Dec-1995 John Dyson <dyson@FreeBSD.org>

Some new anti-deadlock code ended up messing up the paging stats. A modified
version of the code is now in place, and gausspage performance is back
up to where it should be.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# cac597e4 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.

Staticized some functions.

__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.


# 3af76890 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused vars & funcs, make things static, protoize a little bit.


# ff98689d 16-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Fixed recent staticizations. Some protypes for static functions were
left in headers and not staticized.


# f5a12711 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

staticize.


# 23922cca 01-Nov-1995 David Greenman <dg@FreeBSD.org>

Move page fixups (pmap_clear_modify, etc) that happen after paging input
completes out of vm_fault and into the pagers. This get rid of some
redundancy and improves the architecture.

Reviewed by: John Dyson <dyson>


# 2f82e604 23-Sep-1995 David Greenman <dg@FreeBSD.org>

Check that the swap block is valid before including it in a cluster.

Submitted by: John Dyson


# 894048d7 10-Sep-1995 John Dyson <dyson@FreeBSD.org>

Make sure that the prezero flag is cleared when needed.


# ca56715f 06-Sep-1995 John Dyson <dyson@FreeBSD.org>

Fixed a sign reversal problem -- might have cause some Sig-11s that
people have been seeing.


# 170db9c6 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Allow the fault code to use additional clustering info from both
bmap and the swap pager. Improved fault clustering performance.


# 2a4895f4 16-Jul-1995 David Greenman <dg@FreeBSD.org>

1) Merged swpager structure into vm_object.
2) Changed swap_pager internal interfaces to cope w/#1.
3) Eliminated object->copy as we no longer have copy objects.
4) Minor stylistic changes.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 5f55e841 17-May-1995 David Greenman <dg@FreeBSD.org>

Accessing pages beyond the end of a mapped file results in internal
inconsistencies in the VM system that eventually lead to a panic. These
changes fix the behavior to conform to the behavior in SunOS, which is
to deny faults to pages beyond the EOF (returning SIGBUS). Internally,
this is implemented by requiring faults to be within the object size
boundaries. These changes exposed another bug, namely that passing in
an offset to mmap when trying to map an unnamed anonymous region also
results in internal inconsistencies. In this case, the offset is forced
to zero.

Reviewed by: John Dyson and others


# a401ebbe 13-May-1995 David Greenman <dg@FreeBSD.org>

Changed swap partition handling/allocation so that it doesn't
require specific partitions be mentioned in the kernel config
file ("swap on foo" is now obsolete).

From Poul-Henning:

The visible effect is this:

As default, unless
options "NSWAPDEV=23"
is in your config, you will have four swap-devices.
You can swapon(2) any block device you feel like, it doesn't have
to be in the kernel config.

There is a performance/resource win available by getting the NSWAPDEV right
(but only if you have just one swap-device ??), but using that as default
would be too restrictive.

The invisible effect is that:

Swap-handling disappears from the $arch part of the kernel.
It gets a lot simpler (-145 lines) and cleaner.

Reviewed by: John Dyson, David Greenman
Submitted by: Poul-Henning Kamp, with minor changes by me.


# ee3a64c9 10-May-1995 David Greenman <dg@FreeBSD.org>

Changed "handle" from type caddr_t to void *; "handle" is several different
types of pointers, and "char *" is a bad choice for the type.


# 11fda60b 07-May-1995 John Dyson <dyson@FreeBSD.org>

Another error in the correction for trimming swap allocation for
small objects. (This code needs to be revisited.)


# 85b67b98 06-May-1995 John Dyson <dyson@FreeBSD.org>

Fixed a calculation that would once-in-a-while cause the swap_pager
to emit spurious page outside of object type messages. It is not
a fatal condition anyway, so the message will be omitted for
release. Also, the code that "clips" the allocation size, associated
with the above problem, was fixed.


# aba8f38e 19-Apr-1995 David Greenman <dg@FreeBSD.org>

New flag: B_PAGING. Added as part of the vn driver hack.


# 64abb5a5 16-Apr-1995 David Greenman <dg@FreeBSD.org>

Removed obsolete/unused variable declarations.
Removed some extern declarations and included the correct include files.


# c3cb3e12 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


# c419d77e 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a check for wrong object size; print a warning, but deal with it
correctly. The warning will tell us that there is a bug somewhere else
in sizing the object correctly.

Submitted by: John Dyson


# edf8a815 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed redundant newlines that were in some panic strings.


# 63635f5a 11-Mar-1995 David Greenman <dg@FreeBSD.org>

Clear OBJ_INTERNAL flag for device pager objects and named anonymous
objects.


# f919ebde 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# c3a1e425 25-Feb-1995 David Greenman <dg@FreeBSD.org>

Fixed severely broken printf (arguments out of order, no newline).


# c0503609 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Only do object paging_in_progress wakeups if someone is waiting on this
condition.

Submitted by: John Dyson


# 7fb0c17e 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_
pager(). Almost completely rewrote vm_mmap(); when John gets done with
the bottom half, it will be a complete rewrite. Deprecated most use of
vm_object_setpager(). Removed side effect of setting object persist
in vm_object_enter and moved this into the pager(s). A few other
cosmetic changes.


# a1f6d91c 02-Feb-1995 David Greenman <dg@FreeBSD.org>

swap_pager.c:
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.

vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).

vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.

kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.

vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.

vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.

vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).

vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.

Submitted by: John Dyson


# 6d40c3d3 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# 480dff54 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed some formatting weirdness that I overlooked in the previous commit.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 7609ab12 22-Dec-1994 David Greenman <dg@FreeBSD.org>

Initialize b_vnbuf.le_next before returning a new buffer in getpbuf and
trypbuf. Move a couple of splbio's to be slightly less conservative.


# 6f7bc393 21-Dec-1994 David Greenman <dg@FreeBSD.org>

Fixed a benign off by one error.


# 61854083 18-Dec-1994 David Greenman <dg@FreeBSD.org>

Don't ever clear B_BUSY on a pbuf (or any other flag for that matter).
This appears to be the cause of some buffer confusion that leads to
a panic during heavy paging.

Submitted by: John Dyson


# 24ea4a96 13-Nov-1994 David Greenman <dg@FreeBSD.org>

Fixed bugs in accounting of swap space that resulted in the pager thinking
it was out of space when it really wasn't.

Submitted by: John Dyson


# a83c285c 06-Nov-1994 David Greenman <dg@FreeBSD.org>

Fixed return status from pagers. Ahem...the previous method would manufacture
data when it couldn't get it legitimately. :-(

Submitted by: John Dyson


# 1b119d9d 25-Oct-1994 David Greenman <dg@FreeBSD.org>

Improved I/O error reporting.


# 5663e6de 21-Oct-1994 David Greenman <dg@FreeBSD.org>

Various changes to allow operation without any swapspace configured. Note
that this is intended for use only in floppy situations and is done at
the sacrifice of performance in that case (in ther words, this is not the
best solution, but works okay for this exceptional situation).

Submitted by: John Dyson


# 976e77fc 15-Oct-1994 David Greenman <dg@FreeBSD.org>

1) Some of the counters in the vmmeter struct don't fit well into the Mach VM
scheme of things, so I've changed them to be more appropriate. page in/ous
are now associated with the pager that did them. Nuked v_fault as the
only fault of interest that wouldn't be already counted in v_trap is a VM
fault, and this is counted seperately.
2) Implemented most of the remaining counters and corrected the counting of
some that were done wrong. They are all almost correct now...just a few
minor ones left to fix.


# b73f3b1d 13-Oct-1994 David Greenman <dg@FreeBSD.org>

Got rid of redundant declaration warnings.


# 2e1e24dd 13-Oct-1994 David Greenman <dg@FreeBSD.org>

Fixed bug where page modifications would be lost when swap space was
almost depleted.

Reviewed by: John Dyson


# 35c10d22 09-Oct-1994 David Greenman <dg@FreeBSD.org>

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


# 05f0fdd2 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 426de760 24-Sep-1994 David Greenman <dg@FreeBSD.org>

Disabled swap anti-fragmentation code. It reduces swap paging performance
by 20% in my tests, and it appears to be the cause of a swap leak.

Submitted by: John Dyson


# fff93ab6 29-Aug-1994 David Greenman <dg@FreeBSD.org>

Patches from John Dyson to improve swap code efficiency.
Religiously add back pmap_clear_modify() in vnode_pager_input until we figure
out why system performance isn't what we expect.

Submitted by: John Dyson (swap_pager) & David Greenman (vnode_pager)


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# a481f200 07-Aug-1994 David Greenman <dg@FreeBSD.org>

Provide support for upcoming merged VM/buffer cache, and fixed a few bugs
that haven't appeared to manifest themselves (yet).

Submitted by: John Dyson


# 16f62314 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Incorporated post 1.1.5 work from John Dyson. This includes performance
improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/
pmap_kremove. These routine allow fast mapping of pages for those
architectures that have "normal" MMUs. Also included is a fix to the
pageout daemon to properly check a queue end condition.

Submitted by: John Dyson


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 03e6c253 01-Aug-1994 David Greenman <dg@FreeBSD.org>

Removed all code related to the pagescan daemon, and changed 'act_count'
adjustments to compensate for a world without the pagescan daemon.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources