History log of /freebsd-current/sys/vm/uma_int.h
Revision Date Author Comments
# 95ee2897 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: two-line .h pattern

Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 2760658b 02-May-2021 Alexander Motin <mav@FreeBSD.org>

Improve UMA cache reclamation.

When estimating working set size, measure only allocation batches, not free
batches. Allocation and free patterns can be very different. For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.

Update working set size on every reclamation call, shrinking caches faster
under pressure. Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches. I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.

Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers. For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used. And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.

Reviewed by: markj, jeff (previous version)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29790


# aabe13f1 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

uma: Introduce per-domain reclamation functions

Make it possible to reclaim items from a specific NUMA domain.

- Add uma_zone_reclaim_domain() and uma_reclaim_domain().
- Permit parallel reclamations. Use a counter instead of a flag to
synchronize with zone_dtor().
- Use the zone lock to protect cache_shrink() now that parallel reclaims
can happen.
- Add a sysctl that can be used to trigger reclamation from a specific
domain.

Currently the new KPIs are unused, so there should be no functional
change.

Reviewed by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29685


# d2f1c44b 27-Dec-2020 Mark Johnston <markj@FreeBSD.org>

uma: Remove the MINBUCKET flag from the flag name list

This should have been done in r368399 / commit
f8b6c51538fab88a7a62a399fb0948806b06133c.

Reported by: rlibby
Sponsored by: The FreeBSD Foundation


# c3aa3bf9 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: clean up empty lines in .c and .h files


# a2e19465 28-Aug-2020 Eric van Gyzen <vangyzen@FreeBSD.org>

memstat_kvm_uma: fix reading of uma_zone_domain structures

Coverity flagged the scaling by sizeof(uzd). That is the type
of the pointer, so the scaling was already done by pointer arithmetic.
However, this was also passing a stack frame pointer to kvm_read,
so it was doubly wrong.

Move ZDOM_GET into the !_KERNEL section and use it in libmemstat.

Reported by: Coverity
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26213


# c8b0a88b 20-Jun-2020 Jeff Roberson <jeff@FreeBSD.org>

Clarify some language. Favor primary where both master and primary were
used in conjunction with secondary.


# 54007ce8 07-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Clean up uma_int.h a bit.

This makes it easier to write libkvm programs that access UMA data
structures.

- Remove a couple of unused slab functions and make others local to
uma_core.c. Similarly move SLAB_BITSETS, which affects the layout of
slab structures, to uma_core.c.
- Stop defining the slab structures under _KERNEL. There's no real
reason they can't be visible to userspace like the rest of UMA's
structures are.
- Group KEG_ASSERT_COLD with other keg macros.
- Convert an assertion about MAXMEMDOM to use _Static_assert.

No functional change intended.

Discussed with: jeff
Reviewed by: rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23980


# c6fd3e23 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for the bucket cache.

This gives much better concurrency when there are a large number of
cores per-domain and multiple domains. Avoid taking the lock entirely
if it will not be productive. ROUNDROBIN domains will have mixed
memory in each domain and will load balance to all domains.

While here refactor the zone/domain separation and bucket limits to
simplify callers.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23673


# 4ab3aee8 11-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Reduce lock hold time in keg_drain().

Maintain a count of free slabs in the per-domain keg structure and use
that to clear the free slab list in constant time for most cases. This
helps minimize lock contention induced by reclamation, in preparation
for proactive trimming of excesses of free memory.

Reviewed by: jeff, rlibby
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23532


# bae55c4a 06-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: remove UMA_ZFLAG_CACHEONLY flag

UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but
with a more confusing name. Remove the flag, make UMA_ZONE_VM an
inherit flag, and replace all references.

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23516


# ec0d8280 04-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: add UMA_ZONE_CONTIG, and a default contig_alloc

For now, copy the mbuf allocator.

Reviewed by: jeff, markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23237


# dc3915c8 03-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior
and this is more space efficient.

Stop queueing recently used buckets to the head of the list. If the bucket
goes to a different processor the cache coherency will be more expensive.
We already try to encourage cache-hot behavior in the per-cpu layer.

Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23493


# d4665eaa 30-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Implement a safe memory reclamation feature that is tightly coupled with UMA.

This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.

This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.

Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.

Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586


# 9b8db4d0 13-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: split slabzone into two sizes

By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23149


# 4a8b575c 08-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: unify layout paths and improve efficiency

Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
- using the padding of the final item to store the slab header,
- not going OFFPAGE if we have a choice unless it improves efficiency.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23048


# 54c5ae80 08-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: reorganize flags

- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
- Move flag VTOSLAB from public to private.
- Introduce public NOTPAGE flag and make HASH private.
- Introduce public NOTOUCH flag and make OFFPAGE private.
- Update man page.

The net effect of this should be to make the contract with clients more
clear. Clients should choose constraints, UMA will figure out how to
implement them. This also breaks the confusing double meaning of
OFFPAGE.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23016


# 79c9f942 05-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix uma boot pages calculations on NUMA machines that also don't have
MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt
of zones while here. This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23046


# 31c251a0 04-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix an assertion introduced in r356348. On architectures without
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert. Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.

Discussed with: rlibby
Reviewed by: markj
Reported by: lwhsu


# dfe13344 04-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending
on kernel configuration. API users need only specify one if they want to
override the default.

Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.

Reviewed by: markj
Discussed with: rlibby
Differential Revision: https://reviews.freebsd.org/D22831


# 91d947bf 04-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Sort cross-domain frees into per-domain buckets before inserting these
onto their respective bucket lists. This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.

Discussed with: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22830


# 8b987a77 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain keg locks. This provides both a lock and separate space
accounting for each NUMA domain. Independent keg domain locks are important
with cross-domain frees. Hashed zones are non-numa and use a single keg
lock to protect the hash table.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22829


# 727c6918 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use a separate lock for the zone and keg. This provides concurrency
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828


# 4bd61e19 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use atomics for the zone limit and sleeper count. This relies on the
sleepq to serialize sleepers. This patch retains the existing sleep/wakeup
paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup
in one case but otherwise should be bug for bug compatible. In particular,
there are still various races surrounding adjusting the limit via sysctl
that are now documented.

Discussed with: markj
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22827


# cc7ce83a 25-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Further reduce the cacheline footprint of fast allocations by duplicating
the zone size and flags fields in the per-cpu caches. This allows fast
alloctions to proceed only touching the single per-cpu cacheline and
simplifies the common case when no ctor/dtor is specified.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22826


# 376b1ba3 25-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Optimize fast path allocations by storing bucket headers in the per-cpu
cache area. This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22825


# 815db204 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma dbg: flexible size for slab debug bitset too

Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by: jeff (previous version), markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759


# d82c8ffb 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

Revert r355706 & r355710

The quick fix didn't work. I'll sort it out tomorrow.

Revert r355710: "libmemstat: unbreak build"
Revert r355706: "uma dbg: flexible size for slab debug bitset too"


# 7508f15f 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma dbg: flexible size for slab debug bitset too

Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759


# 6d204a6a 10-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: pretty print zone flags sysctl

Requested by: jeff
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22748


# 1e0701e1 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Use a variant slab structure for offpage zones. This saves space in
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22609


# 9b78b1f4 02-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by: markj, rlibby (prior version)
Differential Revision: https://reviews.freebsd.org/D22584


# 6d6a03d7 28-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22563


# 584061b4 28-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab. Remove some nearby
dead code.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22564


# 20a4e154 27-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Implement a sysctl tree for uma zones to assist in debugging and provide
more statistcs than are exported via the ABI stable vmstat interface.
Rename uz_count to uz_bucket_size because even I was confused by the
name after returning to the source years later.

Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22554


# ca293436 27-Nov-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: trash memory when ctor/dtor supplied too

On INVARIANTS kernels, UMA has a use-after-free detection mechanism.
This mechanism previously required that all of the ctor/dtor/uminit/fini
arguments to uma_zcreate() be NULL in order to function. Now, it only
requires that uminit and fini be NULL; now, the trash ctor and dtor will
be called in addition to any supplied ctor or dtor.

Also do a little refactoring for readability of the resulting logic.

This enables use-after-free detection for more zones, and will allow for
simplification of some callers that worked around the previous
restriction (see kern_mbuf.c).

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20722


# 08cfa56e 01-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Extend uma_reclaim() to permit different reclamation targets.

The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667


# c1685086 06-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Add two new kernel options to control memory locality on NUMA hardware.
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
was freed on a different domain from where it was allocated. This is
only used for UMA_ZONE_NUMA (first-touch) zones.
- UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
zones. This tries to maintain locality for kernel memory.

Reviewed by: gallatin, alc, kib
Tested by: pho, gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20929


# 6929b7d1 11-Feb-2019 Pedro F. Giffuni <pfg@FreeBSD.org>

UMA: unsign some variables related to allocation in hash_alloc().

As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.

While here also unsign uh_hashmask, it makes little sense to keep that
signed.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19148


# ad66f958 06-Feb-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Now that there is only one way to allocate a slab, remove uz_slab method.

Discussed with: jeff


# b68d692a 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Whitespace.


# 39669415 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Fix compilation failures on different arches that have vm_machdep.c not
aware of counter_u64_t by including counter.h into uma_int.h. I'm not
happy about this inclusion, but it fixes compilation ASAP.


# 2efcc8cb 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Make uz_allocs, uz_frees and uz_fails counter(9). This removes some
atomic updates and reduces amount of data protected by zone lock.

During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.

Tested by: pho


# bb15d1c7 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

o Move zone limit from keg level up to zone level. This means that now
two zones sharing a keg may have different limits. Now this is going
to work:

zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);

Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.

o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.

o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.

o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.

o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.

Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773


# 3d5e3df7 28-Nov-2018 Gleb Smirnoff <glebius@FreeBSD.org>

For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.

- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
This is a potential bugfix, but in reality I don't think there are any
systems affected, since compiler aligns struct uma_slab anyway.


# 0f9b7bf3 13-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Add accounting to per-domain UMA full bucket caches.

In particular, track the current size of the cache and maintain an
estimate of its working set size. This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages. As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.

Discussed with: alc, glebius, jeff
Tested by: pho (previous version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D16666


# 7571e249 24-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Add an #include required after r339686.

X-MFC with: r339686
Sponsored by: The FreeBSD Foundation


# 194a979e 24-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Use a vm_domainset iterator in keg_fetch_slab().

Previously, it used a hand-rolled round-robin iterator. This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.

Discussed with: jeff
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17420


# 306abf0f 24-Aug-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Either "free" or "allocated" is misleading here, since an item
in a bucket is free from perspective of UMA consumer, and it is
allocated from perspective of keg.

Discussed with: markj
Approved by: re (kib)


# a307fb5b 23-Aug-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix comment. The actual meaning of ub_cnt is the opposite.


# 63b5557b 23-Jun-2018 Jeff Roberson <jeff@FreeBSD.org>

Sort uma_zone fields according to 64 byte cache line with adjacent line
prefetch on 64bit architectures. Prior to this, two lines were needed
for the fast path and each line may fetch an unused adjacent neighbor.
- Move fields used by the fast path into a single line.
- Move constants into the adjacent line which is mostly used for
the spare bucket alloc 'medium path'.
- Unpad the mtx which is only used by the fast path and place it in
a line with rarely used data. This aligns the cachelines better and
eliminates 128 bytes of wasted space.

This gives a 45% improvement on a will-it-scale test on a 24 core machine.

Reviewed by: mmacy


# 12f69195 04-Jun-2018 Justin Hibbits <jhibbits@FreeBSD.org>

Align UMA data to 128 byte cacheline size

Suggested by: mjg


# 782e38aa 11-May-2018 Mateusz Guzik <mjg@FreeBSD.org>

uma: increase alignment to 128 bytes on amd64

Current UMA internals are not suited for efficient operation in
multi-socket environments. In particular there is very common use of
MAXCPU arrays and other fields which are not always properly aligned and
are not local for target threads (apart from the first node of course).
Turns out the existing UMA_ALIGN macro can be used to mostly work around
the problem until the code get fixed. The current setting of 64 bytes
runs into trouble when adjacent cache line prefetcher gets to work.

An example 128-way benchmark doing a lot of malloc/frees has the following
instruction samples:

before:
kernel`lf_advlockasync+0x43b 32940
kernel`malloc+0xe5 42380
kernel`bzero+0x19 47798
kernel`spinlock_exit+0x26 60423
kernel`0xffffffff80 78238
0x0 136947
kernel`uma_zfree_arg+0x46 159594
kernel`uma_zalloc_arg+0x672 180556
kernel`uma_zfree_arg+0x2a 459923
kernel`uma_zalloc_arg+0x5ec 489910

after:
kernel`bzero+0xd 46115
kernel`lf_advlockasync+0x25f 46134
kernel`lf_advlockasync+0x38a 49078
kernel`fget_unlocked+0xd1 49942
kernel`lf_advlockasync+0x43b 55392
kernel`copyin+0x4a 56963
kernel`bzero+0x19 81983
kernel`spinlock_exit+0x26 91889
kernel`0xffffffff80 136357
0x0 239424

See the review for more details.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D15346


# 5073a083 07-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix three miscalculations in amount of boot pages:

o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by: pho [1], jtl [2]


# f4bef67c 05-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.

Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054


# ab3185d1 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# ad5b0f5b 01-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Fix arc after r326347 broke various memory limit queries. Use UMA features
rather than kmem arena size to determine available memory.

Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.

PR: 224330 (partial), 224080
Reviewed by: markj, avg
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13494


# 2e47807c 28-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.

The arena argument to kmem_*() is now only used in an assert. A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA. When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA. On 32bit architectures this should behave much more
gracefully as we exhaust KVA. On 64bit the limits are likely never hit.

Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13187


# fe267a55 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: general adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.


# e04223bf 15-Sep-2017 Mark Johnston <markj@FreeBSD.org>

Include _bitset.h to get BITSET_DEFINE, used to define struct slabbits.

MFC after: 1 week


# 2d54d4bb 13-Sep-2017 Mark Johnston <markj@FreeBSD.org>

Widen uk_pgoff, the slab header offset field.

16 bits is only wide enough for kegs with an item size of up to 64KB.
At that size or larger, slab headers are typically offpage because the
item size is a multiple of the page size, but there is no requirement
that this be the case.

We can widen the field without affecting the layout of struct uma_keg
since the removal of uk_slabsize in r315077 left an adjacent hole.

PR: 218911
MFC after: 2 weeks


# a55ebb7c 11-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

uma: eliminate uk_slabsize field

The field was not used beyond the initial keg setup stage anyway.

MFC after: 1 month (if ever)


# 34caa842 07-Jul-2016 Colin Percival <cperciva@FreeBSD.org>

Autotune the number of pages set aside for UMA startup based on the number
of CPUs present. On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.

The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).

This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64). Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger. It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.

The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.

Tested on: EC2 x1.32xlarge
Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring
vm.boot_pages to be manually adjusted.
Reviewed by: jeff, alc, adrian
Approved by: re (kib)


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# cfcae3f8 29-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Remove UMA_ZONE_REFCNT feature, now unused.

Blessed by: jeff


# b28cc462 09-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by: jhb


# e60b2fcb 03-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with: jtl


# 54503a13 19-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision: https://reviews.freebsd.org/D3864
Reviewed by: gnn
Comments by: rrs, glebius
MFC after: 2 weeks
Sponsored by: Juniper Networks


# 43ffa928 24-Apr-2015 Scott Long <scottl@FreeBSD.org>

Revert r281451. It causes a panic/hang early in boot for a number of
users, myself included. The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.

Obtained from: Netflix, Inc.
MFC after: immediately


# 51cfb0be 12-Apr-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Rework r281162. Indeed, the flexible array member is preferable here.

Suggested by: Justin T. Gibbs

MFC after: 3 days


# f2c2231e 31-Mar-2015 Ryan Stone <rstone@FreeBSD.org>

Fix integer truncation bug in malloc(9)

A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks


# ace66b56 19-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Implement soft pressure on UMA cache bucket sizes.

Every time system detects low memory condition decrease bucket sizes for
each zone by one item. As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.

Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.


# 9eab5484 17-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (hrs)


# c325e866 10-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# dab12c75 24-Jul-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Since r251709 a slab no longer use 8-bit indicies to manage items,
thus remove a stale comment.

Reviewed by: jeff


# 6fd34d6f 25-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Resolve bucket recursion issues by passing a cookie with zone flags
through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg().
- Make some smaller buckets for large zones to further reduce memory
waste.
- Implement uma_zone_reserve(). This holds aside a number of items only
for callers who specify M_USE_RESERVE. buckets will never be filled
from reserve allocations.

Sponsored by: EMC / Isilon Storage Division


# af526374 20-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a per-zone lock for zones without kegs.
- Be more explicit about zone vs keg locking. This functionally changes
almost nothing.
- Add a size parameter to uma_zcache_create() so we can size the buckets.
- Pass the zone to bucket_alloc() so it can modify allocation flags
as appropriate.
- Fix a bug in zone_alloc_bucket() where I missed an address of operator
in a failure case. (Found by pho)

Sponsored by: EMC / Isilon Storage Division


# fc03d22b 17-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

Refine UMA bucket allocation to reduce space consumption and improve
performance.

- Always free to the alloc bucket if there is space. This gives LIFO
allocation order to improve hot-cache performance. This also allows
for zones with a single bucket per-cpu rather than a pair if the entire
working set fits in one bucket.
- Enable per-cpu caches of buckets. To prevent recursive bucket
allocation one bucket zone still has per-cpu caches disabled.
- Pick the initial bucket size based on a table driven maximum size
per-bucket rather than the number of items per-page. This gives
more sane initial sizes.
- Only grow the bucket size when we face contention on the zone lock, this
causes bucket sizes to grow more slowly.
- Adjust the number of items per-bucket to account for the header space.
This packs the buckets more efficiently per-page while making them
not quite powers of two.
- Eliminate the per-zone free bucket list. Always return buckets back
to the bucket zone. This ensures that as zones grow into larger
bucket sizes they eventually discard the smaller sizes. It persists
fewer buckets in the system. The locking is slightly trickier.
- Only switch buckets in zalloc, not zfree, this eliminates pathological
cases where we ping-pong between two buckets.
- Ensure that the thread that fills a new bucket gets to allocate from
it to give a better upper bound on allocation time.

Sponsored by: EMC / Isilon Storage Division


# 0095a784 16-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
pointer items. These zones have no kegs.
- Convert the regular keg based allocator to use the new import/release
functions.
- Move some stats to be atomics since they would require excessive zone
locking/unlocking with the new import/release paradigm. Make
zone_free_item simpler now that callers can manage more stats.
- Check for these cache-only zones in the public APIs and debugging
code by checking zone_first_keg() against NULL.

Sponsored by: EMC / Isilong Storage Division


# ef72505e 13-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset. This is much simpler, has lower space
overhead and is cheaper in most cases.
- Use a second bitmap for invariants asserts and improve the quality of
the asserts as well as the number of erroneous conditions that we will
catch.
- Drastically simplify sizing code. Special case refcnt zones since they
will be going away.
- Update stale comments.

Sponsored by: EMC / Isilon Storage Division


# 85dcf349 09-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Convert UMA code to C99 uintXX_t types.


# 04fc5741 09-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Swap us_freecount and us_flags, achieving same structure size
as before previous commit.

Submitted by: alc


# 8cf455b8 09-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Since now we support 256 items per slab, we need more bits
for us_freecount.

This grows uma_slab_head on 32-bit arches, but growth isn't
significant. Taking kmem zones as example, only the 32 byte
zone is affected, ipers is reduced from 113 to 112.

In collaboration with: kib


# ad97af7e 08-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/counters: UMA_ZONE_PCPU zones.

These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.

To address private item from a CPU simple arithmetics should be used:

item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)

These arithmetics are available as zpcpu_get() macro in pcpu.h.

To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.

Sponsored by: Nginx, Inc.


# a4915c21 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


# 936c747b 21-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Comment fix: there is no ub_ptr, instead explain meaning of uz_count
field verbally.


# 2f891cd5 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks


# bb196eb4 26-Oct-2012 Matthew D Fleming <mdf@FreeBSD.org>

Const-ify the zone name argument to uma_zcreate(9).

MFC after: 3 days


# 342f1793 21-May-2011 Alan Cox <alc@FreeBSD.org>

1. Prior to r214782, UMA did not support multipage allocations before
uma_startup2() was called. Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc(). Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors. Thus, a Boolean can no longer
effectively describe the state of the UMA allocator. Instead, make "booted"
have three values to describe how far initialization has progressed. This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().

2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.

3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between
r182028 and r204128.

Reviewed by: attilio [1], nwhitehorn [3]
Tested by: sbruno


# 59d7277f 20-May-2011 Alan Cox <alc@FreeBSD.org>

Fix spelling errors.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# bf965959 15-Jun-2010 Sean Bruno <sbruno@FreeBSD.org>

Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by: rwatson alc
Approved by: scottl (mentor) peter
Obtained from: Yahoo Inc.


# 1a23373c 22-Mar-2010 Kip Macy <kmacy@FreeBSD.org>

- enable alignment on amd64 only
- only align pcpu caches and the volatile portion of uma_zone


# 6b4391d7 18-Mar-2010 Kip Macy <kmacy@FreeBSD.org>

turn 205266 in to a no-op until the problem can be properly diagnosed


# 5e4bb93c 17-Mar-2010 Kip Macy <kmacy@FreeBSD.org>

Cache line align various structures and move volatile counters to
not share a cache line with (mostly) immutable state

Reviewed by: jeff@
MFC after: 7 days


# f12d6d2a 07-Jan-2010 Antoine Brodin <antoine@FreeBSD.org>

MFC r200129 to stable/8:
Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros.


# 4e2d83fc 05-Dec-2009 Antoine Brodin <antoine@FreeBSD.org>

Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros.

MFC after: 1 month


# e20a199f 25-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

- Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.

Sponsored by: Nokia


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 6ab3b958 09-May-2007 Robert Watson <rwatson@FreeBSD.org>

Update stale comment on protecting UMA per-CPU caches: we now use
critical sections rather than mutexes.


# af17e9a9 04-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Wrap inlines in uma_int.h in #ifdef _KERNEL so that uma_int.h can be
used from memstat_uma.c for the purposes of kvm access without lots
of additional unsafe includes.

MFC after: 3 days


# 08ecce74 16-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Improve canonicalization of copyrights. Order copyrights by order of
assertion (jeff, bmilekic, rwatson).

Suggested ages ago by: bde
MFC after: 1 week


# 2018f30c 15-Jul-2005 Mike Silbersack <silby@FreeBSD.org>

Increase the flags field for kegs from a 16 to a 32 bit value;
we have exhausted all 16 flags.


# 2019094a 15-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Track UMA(9) allocation failures by zone, and export via sysctl.

Requested by: victor cruceru <victor dot cruceru at gmail dot com>
MFC after: 1 week


# 7a52a97e 14-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator
statistics via a binary structure stream:

- Add structure 'uma_stream_header', which defines a stream version,
definition of MAXCPUs used in the stream, and the number of zone
records in the stream.

- Add structure 'uma_type_header', which defines the name, alignment,
size, resource allocation limits, current pages allocated, preferred
bucket size, and central zone + keg statistics.

- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
includes the number of allocations and frees, as well as the number
of free items in the cache.

- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUs uma_percpu_stat structures holding
per-CPU allocation information. Typical values of MAXCPU will be
1 (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.

MFC after: 1 week


# 773df9ab 14-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

In addition to tracking allocs in the zone, also track frees. Add
a zone free counter, as well as a cache free counter.

MFC after: 1 week


# eafc7b54 16-Jun-2005 Alan Cox <alc@FreeBSD.org>

Increase UMA_BOOT_PAGES to prevent a crash during initialization. See
http://docs.FreeBSD.org/cgi/mid.cgi?42AD8270.8060906 for a detailed
description of the crash.

Reported by: Eric Anderson
Approved by: re (scottl)
MFC after: 3 days


# 5d1ae027 29-Apr-2005 Robert Watson <rwatson@FreeBSD.org>

Modify UMA to use critical sections to protect per-CPU caches, rather than
mutexes, which offers lower overhead on both UP and SMP. When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks. We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access. In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.

Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.

Reviewed by: bmilekic, jeff (somewhat)
Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others


# 8076cb52 16-Feb-2005 Bosko Milekic <bmilekic@FreeBSD.org>

Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 7b871205 25-Dec-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Add my copyright and update Jeff's copyright on UMA source files,
as per his request.

Discussed with: Jeffrey Roberson


# 6fc96493 26-Nov-2004 Olivier Houchard <cognet@FreeBSD.org>

Remove useless casts.


# 244f4554 29-Jul-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Rework the way slab header storage space is calculated in UMA.

- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
being allocated if the amount of calculated wasted space is less
than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
case). If the amount of wasted space is >= UMA_MAX_WASTE, then
UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
(so that the offpage slab header zone(s) can be sized accordingly).
The algorithm used to calculate this replaces the old calculation
(which only happened to work coincidentally). We now iterate over
possible object sizes, starting from the smallest one, until we
determine that wastedspace calculated in zone_small_init() might
end up being greater than UMA_MAX_WASTE, at which point we use the
found object size to compute the maximum possible ipers. The
reason this works is because:
- wastedspace versus objectsize is a see-saw function with
local minima all equal to zero and local maxima growing
directly proportioned to objectsize. This implies that
for objects up to or equal a certain objectsize, the see-saw
remains entirely below UMA_MAX_WASTE, so for those objectsizes
it is impossible to ever go OFFPAGE for slab headers.
- ipers (items-per-slab) versus objectsize is an inversely
proportional function which falls off very quickly (very large
for small objectsizes).
- To determine the maximum ipers we'll ever need from OFFPAGE
slab headers we first find the largest objectsize for which
we are guaranteed to not go offpage for and use it to compute
ipers (as though we were offpage). Since the only objectsizes
allowed to go offpage are bigger than the found objectsize,
and since ipers vs objectsize is inversely proportional (and
monotonically decreasing), then we are guaranteed that the
ipers computed is always >= what we will ever need in offpage
slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
padded) size of each freelist index so that offset calculations are
fixed.

This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).

Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.


# 099a0e58 31-May-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# c19aa340 17-Jan-2004 Alan Cox <alc@FreeBSD.org>

Increase UMA_BOOT_PAGES because of changes to pv entry initialization in
revision 1.457 of i386/i386/pmap.c.


# 925692ca 21-Dec-2003 Alan Cox <alc@FreeBSD.org>

- Significantly reduce the number of preallocated pv entries in
pmap_init(). Such a large preallocation is unnecessary and wastes
nearly eight megabytes of kernel virtual address space per gigabyte
of managed physical memory.
- Increase UMA_BOOT_PAGES by two. This enables the removal of
pmap_pv_allocf(). (Note: this function was only used during
initialization, specifically, after pmap_init() but before
pmap_init2(). During pmap_init2(), a new allocator is installed.)


# 9643769a 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove the working-set algorithm. Instead, use the per cpu buckets as the
working set cache. This has several advantages. Firstly, we never touch
the per cpu queues now in the timeout handler. This removes one more
reason for having per cpu locks. Secondly, it reduces the size of the zone
by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This
tidies up other logic as well.
- The 'destroy' flag no longer needs to be passed to zone_drain() since it
always frees everything in the zone's slabs.
- cache_drain() is now only called from zone_dtor() and so it destroys by
default. It also does not need the destroy parameter now.


# 3e0cab95 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove the cache colorization code. We can't use it due to all of the
broken consumers of the malloc interface who assume that the allocated
address will be an even multiple of the size.
- Remove disabled time delay code on uma_reclaim(). The comment there said
it all. It was not an effective strategy and it should not be left in
#if 0'd for all eternity.


# b60f5b79 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags
by accepting the user supplied flags directly. Previously this was not
done so that flags for the same field would not be defined in two
different files. Add comments in each header instructing future
developers on how now to shoot their feet.
- Fix a test for !OFFPAGE which should have been a test for HASH. This would
have caused a panic if we had ever destructed a malloc zone. This also
opens up the possibility that other zones could use the vsetobj() method
rather than a hash.


# cae33c14 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Initialize a pool of bucket zones so that we waste less space on zones that
don't cache as many items.
- Introduce the bucket_alloc(), bucket_free() functions to wrap bucket
allocation. These functions select the appropriate bucket zone to
allocate from or free to.
- Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects
the count of free items in the bucket. This gets rid of many unnatural
subtractions by 1 throughout the code.
- Add ub_entries which reflects the number of entries possibly held in a
bucket.


# 20e8e865 11-Aug-2003 Bosko Milekic <bmilekic@FreeBSD.org>

- When deciding whether to init the zone with small_init or large_init,
compare the zone element size (+1 for the byte of linkage) against
UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
Add a KASSERT in zone_small_init to make sure that the computed
ipers (items per slab) for the zone is not zero, despite the addition
of the check, just to be sure (this part submitted by: silby)

- UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies
CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the
case of bucket allocations, but in addition to that also ensures that
we don't setup the zone with OFFPAGE slab headers allocated from the
slabzone. This means that we're not allowed to have a UMA_ZONE_VM
zone initialized for large items (zone_large_init) because it would
require the slab headers to be allocated from slabzone, and hence
kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd
before kmem_map is suballoc'd from kernel_map, which is why this
change is necessary.


# f828e5be 29-Jul-2003 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of the ill-conceived uz_cachefree member of uma_zone.
- In sysctl_vm_zone use the per cpu locks to read the current cache
statistics this makes them more accurate while under heavy load.

Submitted by: tegge


# d88797c2 25-Jun-2003 Bosko Milekic <bmilekic@FreeBSD.org>

Move the pcpu lock out of the uma_cache and instead have a single set
of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).

No Objections from jeff.


# c5d771b8 31-May-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Prepend _ to internal union members to avoid ambiguity.

Found by: FlexeLint


# 48eea375 31-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add support for machine dependant page allocation routines. MD code
may define UMA_MD_SMALL_ALLOC to make use of this feature.

Reviewed by: peter, jake


# f461cf22 19-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Use my freebsd email alias in the copyright.
- Remove redundant instances of my email alias in the file summary.


# 99571dc3 18-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 18aa2de5 17-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

- Introduce the new M_NOVM option which tells uma to only check the currently
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.

- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.

- Use the ZONE_VM flag on vm map entries and pv entries on x86.


# 28bc4419 29-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own
mutex class. Currently this is only used for kmapentzone because kmapents
are are potentially allocated when freeing memory. This is not dangerous
though because no other allocations will be done while holding the
kmapentzone lock.


# af7f9b97 13-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Fix the calculation that determines uz_maxpages. It was off for large zones.
Fortunately we have no large zones with maximums specified yet, so it wasn't
breaking anything.

Implement blocking when a zone exceeds the maximum and M_WAITOK is specified.
Previously this just failed like the old zone allocator did. The old zone
allocator didn't support WAITOK/NOWAIT though so we should do what we
advertise.

While I was in there I cleaned up some more zalloc logic to further simplify
that code path and reduce redundant code. This was needed to make the blocking
work properly anyway.


# 1d4cb54b 08-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Quiet witness warnings about acquiring several zone locks. In the case that
this happens it is OK.


# a553d4b8 07-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Rework most of the bucket allocation and free code so that per cpu locks are
never held across blocking operations. Also, fix two other lock order
reversals that were exposed by jhb's witness change.

The free path previously had a bug that would cause it to skip the free bucket
list in some cases and go straight to allocating a new bucket. This has been
fixed as well.

These changes made the bucket handling code much cleaner and removed quite a
few lock operations. This should be marginally faster now.

It is now possible to call malloc w/o Giant and avoid any witness warnings.
This still isn't entirely safe though because malloc_type statistics are not
protected by any lock.


# c235bfa5 07-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Spelling correction; s/seperate/separate/g

Submitted by: eric


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# f22a4b62 27-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by: jhb


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@