History log of /freebsd-10.0-release/sys/vm/uma_core.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 258911 04-Dec-2013 rodrigc

MFC r258737

In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty"
message printed to the console. This makes it easier to track down
the source of certain memory leaks.

Suggested by: adrian
Approved by: re (gjb)


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 255097 31-Aug-2013 mckusick

Fix bug introduced in rewrite of keg_free_slab in -r251894.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.

I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.

No MFC needed as this code exists only in HEAD.

Reviewed by: kib, jeff
Tested by: pho


# 254182 10-Aug-2013 kib

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 254025 07-Aug-2013 jeff

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 253565 23-Jul-2013 glebius

Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This
allows us to init counter zone at early stage of boot.

Reviewed by: kib
Tested by: Lytochkin Boris <lytboris gmail.com>


# 252358 28-Jun-2013 davide

Remove a spurious keg lock acquisition.


# 252226 25-Jun-2013 jeff

- Resolve bucket recursion issues by passing a cookie with zone flags
through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg().
- Make some smaller buckets for large zones to further reduce memory
waste.
- Implement uma_zone_reserve(). This holds aside a number of items only
for callers who specify M_USE_RESERVE. buckets will never be filled
from reserve allocations.

Sponsored by: EMC / Isilon Storage Division


# 252040 20-Jun-2013 jeff

- Add a per-zone lock for zones without kegs.
- Be more explicit about zone vs keg locking. This functionally changes
almost nothing.
- Add a size parameter to uma_zcache_create() so we can size the buckets.
- Pass the zone to bucket_alloc() so it can modify allocation flags
as appropriate.
- Fix a bug in zone_alloc_bucket() where I missed an address of operator
in a failure case. (Found by pho)

Sponsored by: EMC / Isilon Storage Division


# 251983 19-Jun-2013 jeff

- Persist the caller's flags in the bucket allocation flags so we don't
lose a M_NOVM when we recurse into a bucket allocation.

Sponsored by: EMC / Isilon Storage Division


# 251894 18-Jun-2013 jeff

Refine UMA bucket allocation to reduce space consumption and improve
performance.

- Always free to the alloc bucket if there is space. This gives LIFO
allocation order to improve hot-cache performance. This also allows
for zones with a single bucket per-cpu rather than a pair if the entire
working set fits in one bucket.
- Enable per-cpu caches of buckets. To prevent recursive bucket
allocation one bucket zone still has per-cpu caches disabled.
- Pick the initial bucket size based on a table driven maximum size
per-bucket rather than the number of items per-page. This gives
more sane initial sizes.
- Only grow the bucket size when we face contention on the zone lock, this
causes bucket sizes to grow more slowly.
- Adjust the number of items per-bucket to account for the header space.
This packs the buckets more efficiently per-page while making them
not quite powers of two.
- Eliminate the per-zone free bucket list. Always return buckets back
to the bucket zone. This ensures that as zones grow into larger
bucket sizes they eventually discard the smaller sizes. It persists
fewer buckets in the system. The locking is slightly trickier.
- Only switch buckets in zalloc, not zfree, this eliminates pathological
cases where we ping-pong between two buckets.
- Ensure that the thread that fills a new bucket gets to allocate from
it to give a better upper bound on allocation time.

Sponsored by: EMC / Isilon Storage Division


# 251826 17-Jun-2013 jeff

- Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
pointer items. These zones have no kegs.
- Convert the regular keg based allocator to use the new import/release
functions.
- Move some stats to be atomics since they would require excessive zone
locking/unlocking with the new import/release paradigm. Make
zone_free_item simpler now that callers can manage more stats.
- Check for these cache-only zones in the public APIs and debugging
code by checking zone_first_keg() against NULL.

Sponsored by: EMC / Isilong Storage Division


# 251709 13-Jun-2013 jeff

- Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset. This is much simpler, has lower space
overhead and is cheaper in most cases.
- Use a second bitmap for invariants asserts and improve the quality of
the asserts as well as the number of erroneous conditions that we will
catch.
- Drastically simplify sizing code. Special case refcnt zones since they
will be going away.
- Update stale comments.

Sponsored by: EMC / Isilon Storage Division


# 249763 22-Apr-2013 glebius

Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus
isn't yet initialized. Otherwise we will panic at first allocation later.

Sponsored by: Nginx, Inc.


# 249313 09-Apr-2013 glebius

Convert UMA code to C99 uintXX_t types.


# 249305 09-Apr-2013 glebius

Fix KASSERTs: maximum number of items per slab is 256.


# 249264 08-Apr-2013 glebius

Merge from projects/counters: UMA_ZONE_PCPU zones.

These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.

To address private item from a CPU simple arithmetics should be used:

item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)

These arithmetics are available as zpcpu_get() macro in pcpu.h.

To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.

Sponsored by: Nginx, Inc.


# 248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 247360 26-Feb-2013 attilio

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


# 246087 29-Jan-2013 glebius

Fix typo in debug printf.


# 243998 07-Dec-2012 pjd

Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 242152 26-Oct-2012 mdf

Const-ify the zone name argument to uma_zcreate(9).

MFC after: 3 days


# 241825 22-Oct-2012 eadler

Print flags as hex instead of an integer.

PR: kern/168210
Submitted by: linimon
Reviewed by: alc
Approved by: cperciva
MFC after: 3 days


# 240676 18-Sep-2012 glebius

If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory
in an allocation for a slab.

Reviewed by: jeff


# 239710 26-Aug-2012 glebius

Fix function name in keg_cachespread_init() assert.


# 238206 07-Jul-2012 eadler

Add missing sleep stat increase

PR: kern/168211
Submitted by: linimon
Reviewed by: alc
Approved by: cperciva
MFC after: 3 days


# 238000 02-Jul-2012 jhb

Honor db_pager_quit in 'show uma' and 'show malloc'.

MFC after: 1 month


# 235854 23-May-2012 emax

Tweak condition for disabling allocation from per-CPU buckets in
low memory situation. I've observed a situation where per-CPU
allocations were disabled while there were enough free cached pages.
Basically, cnt.v_free_count was sitting stable at a value lower
than cnt.v_free_min and that caused massive performance drop.

Reviewed by: alc
MFC after: 1 week


# 230623 27-Jan-2012 kmacy

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


# 226313 12-Oct-2011 glebius

Make memguard(9) capable to guard uma(9) allocations.


# 222184 22-May-2011 alc

Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined,
startup_alloc() must be used until uma_startup2() is called.

Reported by: jh


# 222163 21-May-2011 alc

1. Prior to r214782, UMA did not support multipage allocations before
uma_startup2() was called. Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc(). Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors. Thus, a Boolean can no longer
effectively describe the state of the UMA allocator. Instead, make "booted"
have three values to describe how far initialization has progressed. This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().

2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.

3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between
r182028 and r204128.

Reviewed by: attilio [1], nwhitehorn [3]
Tested by: sbruno


# 222132 20-May-2011 alc

Eliminate a redundant #include. ("vm/vm_param.h" already includes
"machine/vmparam.h".)


# 219819 21-Mar-2011 jeff

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


# 217916 26-Jan-2011 mdf

Explicitly wire the user buffer rather than doing it implicitly in
sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.

Inspired by: rwatson


# 214782 04-Nov-2010 jhb

Update startup_alloc() to support multi-page allocations and allow internal
zones whose objects are larger than a page to use startup_alloc(). This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.

Submitted by: trema
Reviewed by: attilio
MFC after: 1 week


# 214062 19-Oct-2010 mdf

uma_zfree(zone, NULL) should do nothing, to match free(9).

Noticed by: Ron Steinke <rsteinke at isilon dot com>
MFC after: 3 days


# 213911 16-Oct-2010 lstewart

Change uma_zone_set_max to return the effective value of "nitems" after
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.

Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb


# 213910 16-Oct-2010 lstewart

- Simplify implementation of uma_zone_get_max.
- Add uma_zone_get_cur which returns the current approximate occupancy of
a zone. This is useful for providing stats via sysctl amongst other things.

Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb
MFC after: 2 weeks


# 212750 16-Sep-2010 mdf

Re-add r212370 now that the LOR in powerpc64 has been resolved:

Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte. This behaviour was preserved, though it should not be
necessary.

Reviewed by: phk (original patch)


# 212572 13-Sep-2010 mdf

Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn


# 212370 09-Sep-2010 mdf

Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte. This behaviour was preserved, though it should not be necessary.

Reviewed by: phk


# 211396 16-Aug-2010 andre

Add uma_zone_get_max() to obtain the effective limit after a call
to uma_zone_set_max().

The UMA zone limit is not exactly set to the value supplied but
rounded up to completely fill the backing store increment (a page
normally). This can lead to surprising situations where the number
of elements allocated from UMA is higher than the supplied limit
value. The new get function reads back the effective value so that
the supplied limit value can be adjusted to the real limit.

Reviewed by: jeffr
MFC after: 1 week


# 209215 15-Jun-2010 sbruno

Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by: rwatson alc
Approved by: scottl (mentor) peter
Obtained from: Yahoo Inc.


# 209059 11-Jun-2010 jhb

Update several places that iterate over CPUs to use CPU_FOREACH().


# 207576 03-May-2010 alc

It makes more sense for the object-based backend allocator to use OBJT_PHYS
objects instead of OBJT_DEFAULT objects because we never reclaim or pageout
the allocated pages. Moreover, they are mapped with pmap_qenter(), which
creates unmanaged mappings.

Reviewed by: kib


# 207410 29-Apr-2010 kmacy

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 201145 28-Dec-2009 antoine

(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month


# 194429 18-Jun-2009 alc

Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an
unnecessary newline character from the end of two panic messages.)


# 187681 25-Jan-2009 jeff

- Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.

Sponsored by: Nokia


# 182047 23-Aug-2008 antoine

Remove unused variable nosleepwithlocks.

PR: 126609
Submitted by: Mateusz Guzik
MFC after: 1 month
X-MFC: to stable/7 only, this variable is still used in stable/6


# 182028 22-Aug-2008 nwhitehorn

Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator.

Approved by: marcel (mentor)


# 177921 04-Apr-2008 alc

Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame
allocators. Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object. In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT. This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL. This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks


# 172545 11-Oct-2007 jhb

Allow recursion on the 'zones' internal UMA zone.

Submitted by: thompsa
MFC after: 1 week
Approved by: re (kensmith)
Discussed with: jeff


# 170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 166654 11-Feb-2007 rwatson

Add uma_set_align() interface, which will be called at most once during
boot by MD code to indicated detected alignment preference. Rather than
cache alignment being encoded in UMA consumers by defining a global
alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now
a special value (-1) that causes UMA to look at registered alignment. If
no preferred alignment has been selected by MD code, a default alignment
of (16 - 1) will be used.

Currently, no hardware platforms specify alignment; architecture
maintainers will need to modify MD startup code to specify an alignment
if desired. This must occur before initialization of UMA so that all UMA
zones pick up the requested alignment.

Reviewed by: jeff, alc
Submitted by: attilio


# 166213 24-Jan-2007 mohans

Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@


# 166211 24-Jan-2007 mohans

Fix for a bug where only one process (of multiple) blocked on
maxpages on a zone is woken up, with the rest never being woken up as
a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked
procsses instead. This change introduces a thundering herd, but since
this should be relatively infrequent, optimizing this (by introducing
a count of blocked processes, for example) may be premature.

Reviewd by: ups@


# 165928 10-Jan-2007 rwatson

Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when
allocations were made using improper flags in interrupt context.
Replace with a simple WITNESS warning call. This restores the
invariant that M_WAITOK allocations will always succeed or die
horribly trying, which is relied on by many UMA consumers.

MFC after: 3 weeks
Discussed with: jhb


# 165809 05-Jan-2007 jhb

- Add a new function uma_zone_exhausted() to see if a zone is full.
- Add a printf in swp_pager_meta_build() to warn if the swapzone becomes
exhausted so that there's at least a warning before a box that runs out
of swapzone space before running out of swap space deadlocks.

MFC after: 1 week
Reviwed by: alc


# 163702 26-Oct-2006 rwatson

Better align output of "show uma" by moving from displaying the basic
counters of allocs/frees/use for each zone to the same statistics
shown by userspace "vmstat -z".

MFC after: 3 days


# 160460 17-Jul-2006 rwatson

Fix build of uma_core.c when DDB is not compiled into the kernel by
making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now.

Submitted by: Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>


# 160414 16-Jul-2006 rwatson

Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x,
libmemstat(3) is used by vmstat (and friends) to produce more accurate
and more detailed statistics information in a machine-readable way,
and vmstat continues to provide the same text-based front-end.

This change should not be MFC'd.


# 158803 21-May-2006 rwatson

When allocating a bucket to hold a free'd item in UMA fails, don't
report this as an allocation failure for the item type. The failure
will be separately recorded with the bucket type. This my eliminate
high mbuf allocation failure counts under some circumstances, which
can be alarming in appearance, but not actually a problem in
practice.

MFC after: 2 weeks
Reported by: ps, Peter J. Blok <pblok at bsd4all dot org>,
OxY <oxy at field dot hu>,
Gabor MICSKO <gmicskoa at szintezis dot hu>


# 155551 11-Feb-2006 rwatson

Skip per-cpu caches associated with absent CPUs when generating a
memory statistics record stream via sysctl.

MFC after: 3 days


# 154934 27-Jan-2006 jhb

Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function.
The difference between WITNESS_CHECK() and WITNESS_WARN() is that
WITNESS_CHECK() should be used in the places that the return value of
witness_warn() is checked, whereas WITNESS_WARN() should be used in places
where the return value is ignored. Specifically, in a kernel without
WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as
WITNESS_CHECK evaluates to 0. I also updated the one place that was
checking the return value of WITNESS_WARN() to use WITNESS_CHECK.


# 154076 06-Jan-2006 jhb

Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro
and trim another unneeded #ifdef (it was just around a macro that is
already conditionally defined).


# 151526 20-Oct-2005 rwatson

Change format string for u_int64_t to %ju from %llu, in order to use the
correct format string on 64-bit systems.

Pointed out by: pjd


# 151516 20-Oct-2005 rwatson

Add a "show uma" command to DDB, which prints out the current stats for
available UMA zones. Quite useful for post-mortem debugging of memory
leaks without a dump device configured on a panicked box.

MFC after: 2 weeks


# 151104 08-Oct-2005 des

As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup()
still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages
as an additional argument.

MFC after: 2 weeks


# 149900 09-Sep-2005 alc

Introduce a new lock for the purpose of synchronizing access to the
UMA boot pages.

Disable recursion on the general UMA lock now that startup_alloc() no
longer uses it.

Eliminate the variable uma_boot_free. It serves no purpose.

Note: This change eliminates a lock-order reversal between a system
map mutex and the UMA lock. See
http://sources.zabbadoz.net/freebsd/lor.html#109 for details.

MFC after: 3 days


# 148371 24-Jul-2005 rwatson

Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the
monitoring API, which might or might not be the same as the internal
maximum (currently none).

Export flag information on UMA zones -- in particular, whether or
not this is a secondary zone, and so the keg free count should be
considered in that light.

MFC after: 1 day


# 148194 20-Jul-2005 rwatson

Further UMA statistics related changes:

- Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to
to update the zone's uz_frees statistic. Previously, the statistic was
updated unconditionally.

- Use the flag in situations where a "real" free occurs: i.e., one where
the caller is freeing an allocated item, to be differentiated from
situations where uma_zfree_internal() is used to tear down the item
during slab teardown in order to invoke its fini() method. Also use
the flag when UMA is freeing its internal objects.

- When exchanging a bucket with the zone from the per-CPU cache when
freeing an item, flush cache statistics back to the zone (since the
zone lock and critical section are both held) to match the allocation
case.

MFC after: 3 days


# 148079 16-Jul-2005 rwatson

Use mp_maxid in preference to MAXCPU when creating exports of UMA
per-CPU cache statistics. UMA sizes the cache array based on the
number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU
could read off the end of the array (into the next zone).

Reported by: yongari
MFC after: 1 week


# 148078 16-Jul-2005 rwatson

Improve canonicalization of copyrights. Order copyrights by order of
assertion (jeff, bmilekic, rwatson).

Suggested ages ago by: bde
MFC after: 1 week


# 148077 16-Jul-2005 rwatson

Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that
it covers the following of the uc_alloc/freebucket cache pointers.
Originally, I felt that the race wasn't helped by holding the mutex,
hence a comment in the code and not holding it across the cache access.
However, it does improve consistency, as while it doesn't prevent
bucket exchange, it does prevent bucket pointer invalidation. So a
race in gathering cache free space statistics still can occur, but not
one that follows an invalid bucket pointer, if the mutex is held.

Submitted by: yongari
MFC after: 1 week


# 148072 16-Jul-2005 silby

Increase the flags field for kegs from a 16 to a 32 bit value;
we have exhausted all 16 flags.


# 148070 15-Jul-2005 rwatson

Track UMA(9) allocation failures by zone, and export via sysctl.

Requested by: victor cruceru <victor dot cruceru at gmail dot com>
MFC after: 1 week


# 147996 14-Jul-2005 rwatson

Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator
statistics via a binary structure stream:

- Add structure 'uma_stream_header', which defines a stream version,
definition of MAXCPUs used in the stream, and the number of zone
records in the stream.

- Add structure 'uma_type_header', which defines the name, alignment,
size, resource allocation limits, current pages allocated, preferred
bucket size, and central zone + keg statistics.

- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
includes the number of allocations and frees, as well as the number
of free items in the cache.

- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUs uma_percpu_stat structures holding
per-CPU allocation information. Typical values of MAXCPU will be
1 (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.

MFC after: 1 week


# 147995 14-Jul-2005 rwatson

In addition to tracking allocs in the zone, also track frees. Add
a zone free counter, as well as a cache free counter.

MFC after: 1 week


# 147994 14-Jul-2005 rwatson

In an earlier world order, UMA would flush per-CPU statistics to the
zone whenever it was moving buckets between the zone and the cache,
or when coalescing statistics across the CPU. Remove flushing of
statistics to the zone when coalescing statistics as part of sysctl,
as we won't be running on the right CPU to write to the cache
statistics.

Add a missed gathering of statistics: when uma_zalloc_internal()
does a special case allocation of a single item, make sure to update
the zone statistics to represent this. Previously this case wasn't
accounted for in user-visible statistics.

MFC after: 1 week


# 145686 29-Apr-2005 rwatson

Modify UMA to use critical sections to protect per-CPU caches, rather than
mutexes, which offers lower overhead on both UP and SMP. When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks. We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access. In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.

Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.

Reviewed by: bmilekic, jeff (somewhat)
Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others


# 142368 24-Feb-2005 alc

Forced commit: The previous revision's message should have referred to
revision 1.115, not revision 1.114.


# 142367 24-Feb-2005 alc

Revert the first part of revision 1.114 and modify the second part. On
architectures implementing uma_small_alloc() pages do not necessarily
belong to the kmem object.


# 141991 16-Feb-2005 bmilekic

Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h


# 141983 16-Feb-2005 bmilekic

Make UMA set the overloaded page->object back to kmem_object for
UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly
came from kmem_map for those two. Previously it would set it back
to NULL for UMA_ZONE_REFCNT zones and although this was probably not
fatal, it added MORE code for no reason.


# 140031 11-Jan-2005 bmilekic

While we want the recursion protection for the bucket zones so that
recursion from the VM is handled (and the calling code that allocates
buckets knows how to deal with it), we do not want to prevent allocation
from the slab header zones (slabzone and slabrefzone) if uk_recurse is
not zero for them. The reason is that it could lead to NULL being
returned for the slab header allocations even in the M_WAITOK
case, and the caller can't handle that (this is also explained in a
comment with this commit).

The problem analysis is documented in our mailing lists:
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current

(see entire thread for proper context).

Crash dump data provided by: Peter Holm <peter@holm.cc>


# 139996 10-Jan-2005 stefanf

ISO C requires at least one element in an initialiser list.


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 139318 25-Dec-2004 bmilekic

Add my copyright and update Jeff's copyright on UMA source files,
as per his request.

Discussed with: Jeffrey Roberson


# 137309 06-Nov-2004 rwatson

Abstract the logic to look up the uma_bucket_zone given a desired
number of entries into bucket_zone_lookup(), which helps make more
clear the logic of consumers of bucket zones.

Annotate the behavior of bucket_init() with a comment indicating
how the various data structures, including the bucket lookup tables,
are initialized.


# 137305 06-Nov-2004 rwatson

Annotate what bucket_size[] array does; staticize since it's used only
in uma_core.c.


# 137001 27-Oct-2004 bmilekic

Fix a INVARIANTS-only bug introduced in Revision 1.104:

IF INVARIANTS is defined, and in the rare case that we have
allocated some objects from the slab and at least one initializer
on at least one of those objects failed, and we need to fail the
allocation and push the uninitialized items back into the slab
caches -- in that scenario, we would fail to [re]set the
bucket cache's ub_bucket item references to NULL, which would
eventually trigger a KASSERT.


# 136334 09-Oct-2004 green

In the previous revision, I did not intend to change the default value
of "nosleepwithlocks."

Submitted by: ru


# 136276 08-Oct-2004 green

Fix critical stability problems that can cause UMA mbuf cluster
state management corruption, mbuf leaks, general mbuf corruption,
and at least on i386 a first level splash damage radius that
encompasses up to about half a megabyte of the memory after
an mbuf cluster's allocation slab. In short, this has caused
instability nightmares anywhere the right kind of network traffic
is present.

When the polymorphic refcount slabs were added to UMA, the new types
were not used pervasively. In particular, the slab management
structure was turned into one for refcounts, and one for non-refcounts
(supposed to be mostly like the old slab management structure),
but the latter was almost always used through out. In general, every
access to zones with UMA_ZONE_REFCNT turned on corrupted the
"next free" slab offset offset and the refcount with each other and
with other allocations (on i386, 2 mbuf clusters per 4096 byte slab).

Fix things so that the right type is used to access refcounted zones
where it was not before. There are additional errors in gross
overestimation of padding, it seems, that would cause a large kegs
(nee zones) to be allocated when small ones would do. Unless I have
analyzed this incorrectly, it is not directly harmful.


# 133230 06-Aug-2004 rwatson

Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg().
This doesn't trace every event of interest in UMA, but provides
enough basic information to explain lock traces and sleep patterns.


# 132987 01-Aug-2004 green

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


# 132842 29-Jul-2004 bmilekic

Rework the way slab header storage space is calculated in UMA.

- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
being allocated if the amount of calculated wasted space is less
than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
case). If the amount of wasted space is >= UMA_MAX_WASTE, then
UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
(so that the offpage slab header zone(s) can be sized accordingly).
The algorithm used to calculate this replaces the old calculation
(which only happened to work coincidentally). We now iterate over
possible object sizes, starting from the smallest one, until we
determine that wastedspace calculated in zone_small_init() might
end up being greater than UMA_MAX_WASTE, at which point we use the
found object size to compute the maximum possible ipers. The
reason this works is because:
- wastedspace versus objectsize is a see-saw function with
local minima all equal to zero and local maxima growing
directly proportioned to objectsize. This implies that
for objects up to or equal a certain objectsize, the see-saw
remains entirely below UMA_MAX_WASTE, so for those objectsizes
it is impossible to ever go OFFPAGE for slab headers.
- ipers (items-per-slab) versus objectsize is an inversely
proportional function which falls off very quickly (very large
for small objectsizes).
- To determine the maximum ipers we'll ever need from OFFPAGE
slab headers we first find the largest objectsize for which
we are guaranteed to not go offpage for and use it to compute
ipers (as though we were offpage). Since the only objectsizes
allowed to go offpage are bigger than the found objectsize,
and since ipers vs objectsize is inversely proportional (and
monotonically decreasing), then we are guaranteed that the
ipers computed is always >= what we will ever need in offpage
slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
padded) size of each freelist index so that offset calculations are
fixed.

This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).

Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.


# 132550 22-Jul-2004 alc

- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:

- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)


# 132407 19-Jul-2004 green

Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in
GENERIC/for WITNESS users, make sure the sysctl to disable the behavior
is read-only and always enabled.


# 131573 04-Jul-2004 bmilekic

Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1
and WITNESS is not built, then force all M_WAITOK allocations to
M_NOWAIT behavior (transparently). This is to be used temporarily
if wierd deadlocks are reported because we still have code paths
that perform M_WAITOK allocations with lock(s) held, which can
lead to deadlock. If WITNESS is compiled, then the sysctl is ignored
and we ask witness to tell us wether we have locks held, converting
to M_NOWAIT behavior only if it tells us that we do.

Note this removes the previous mbuf.h inclusion as well (only needed
by last revision), and cleans up unneeded [artificial] comparisons
to just the mbuf zones. The problem described above has nothing to
do with previous mbuf wait behavior; it is a general problem.


# 131572 04-Jul-2004 green

Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related
zones, and do it by direct comparison of uma_zone_t instead of strcmp.

The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but
this is mostly no longer the case. M_WAITOK has taken over the spot
M_TRYWAIT used to have, and for mbuf things, still may return NULL if
the code path is incorrectly holding a mutex going into mbuf allocation
functions.

The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock
the system to try to malloc or uma_zalloc something with a mutex held
and M_WAITOK specified, it is absolutely required to not return NULL
and will result in instability and/or security breaches otherwise.
There is still room to add the WITNESS_WARN() to all cases so that
we are notified of the possibility of deadlocks, but it cannot change
the value of the "badness" variable and allow allocation to actually
fail except for the specialized cases which used to be M_TRYWAIT.


# 131528 03-Jul-2004 green

Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject
to failing -- that is, allocations via malloc(M_WAITOK) that are required
to never fail -- if WITNESS is not defined. While everyone should be
running WITNESS, in any case, zone "Mbuf" allocations are really the only
ones that should be screwed with by this hack.

This hack is crashing people, and would continue to do so with or without
WITNESS. Things shouldn't be allocating with M_WAITOK with locks held,
but it's not okay just to always remove M_WAITOK when !WITNESS.

Reported by: Bernd Walter <ticso@cicely5.cicely.de>


# 130995 23-Jun-2004 bmilekic

Make uma_mtx MTX_RECURSE. Here's why:

The general UMA lock is a recursion-allowed lock because
there is a code path where, while we're still configured
to use startup_alloc() for backend page allocations, we
may end up in uma_reclaim() which calls zone_foreach(zone_drain),
which grabs uma_mtx, only to later call into startup_alloc()
because while freeing we needed to allocate a bucket. Since
startup_alloc() also takes uma_mtx, we need to be able to
recurse on it.

This exact explanation also added as comment above mtx_init().

Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>


# 130283 09-Jun-2004 bmilekic

Backout previous change, I think Julian has a better solution which
does not require type-stable refcnts here.


# 130278 09-Jun-2004 bmilekic

Make the slabrefzone, the zone from which we allocated slabs with
internal reference counters, UMA_ZONE_NOFREE. This way, those slabs
(with their ref counts) will be effectively type-stable, then using
a trick like this on the refcount is no longer dangerous:

MEXT_REM_REF(m);
if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) {
if (m->m_ext.ext_type == EXT_PACKET) {
uma_zfree(zone_pack, m);
return;
} else if (m->m_ext.ext_type == EXT_CLUSTER) {
uma_zfree(zone_clust, m->m_ext.ext_buf);
m->m_ext.ext_buf = NULL;
} else {
(*(m->m_ext.ext_free))(m->m_ext.ext_buf,
m->m_ext.ext_args);
if (m->m_ext.ext_type != EXT_EXTREF)
free(m->m_ext.ref_cnt, M_MBUF);
}
}
uma_zfree(zone_mbuf, m);

Previously, a second thread hitting the above cmpset might
actually read the refcnt AFTER it has already been freed. A very
rare occurance. Now we'll know that it won't be freed, though.

Spotted by: julian, pjd


# 129906 31-May-2004 bmilekic

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# 126793 10-Mar-2004 alc

- Make the acquisition of Giant in vm_fault_unwire() conditional on the
pmap. For the kernel pmap, Giant is not required. In general, for
other pmaps, Giant is required by i386's pmap_pte() implementation.
Specifically, the use of PMAP2/PADDR2 is synchronized by Giant.
Note: In principle, updates to the kernel pmap's wired count could be
lost without Giant. However, in practice, we never use the kernel
pmap's wired count. This will be resolved when pmap locking appears.
- With the above change, cpu_thread_clean() and uma_large_free() need
not acquire Giant. (The first case is simply the revival of
i386/i386/vm_machdep.c's revision 1.226 by peter.)


# 126714 07-Mar-2004 rwatson

Mark uma_callout as CALLOUT_MPSAFE, as uma_timeout can run MPSAFE.

Reviewed by: jeff


# 125294 01-Feb-2004 jeff

- Fix a problem where we did not drain the cache of buckets in the zone
when uma_reclaim() was called. This was introduced when the zone
working-set algorithm was removed in favor of using the per cpu caches
as the working set.


# 125246 30-Jan-2004 des

Mechanical whitespace cleanup.


# 123126 03-Dec-2003 jhb

Fix all users of mp_maxid to use the same semantics, namely:

1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.

Approved by: re (scottl)
Tested on: i386, amd64, alpha


# 123073 30-Nov-2003 jeff

- Unbreak UP. mp_maxid is not defined on uni-processor machines, although
I believe it and the other MP variables should be. For now, just define it
here and wait for jhb to clean it up later.

Approved by: re (rwatson)


# 123057 30-Nov-2003 jeff

- Replace the local maxcpu with mp_maxid. Previously, if mp_maxid
was equal to MAXCPU, we would overrun the pcpu_mtx array because maxcpu
was calculated incorrectly.
- Add some more debugging code so that memory leaks at the time of
uma_zdestroy() are more easily diagnosed.

Approved by: re (rwatson)


# 122680 14-Nov-2003 alc

- Remove use of Giant from uma_zone_set_obj().


# 120311 21-Sep-2003 jeff

- Fix MD_SMALL_ALLOC on architectures that support it. Define a new alloc
function, startup_alloc(), that is used for single page allocations prior
to the VM starting up. If it is used after the VM startups up, it
replaces the zone's allocf pointer with either page_alloc() or
uma_small_alloc() where appropriate.

Pointy hat to: me
Tested by: phk/amd64, me/x86


# 120305 20-Sep-2003 peter

Bad Jeffr! No cookie!

Temporarily disable the UMA_MD_SMALL_ALLOC stuff since recent commits
break sparc64, amd64, ia64 and alpha. It appears only i386 and maybe
powerpc were not broken.


# 120262 19-Sep-2003 jeff

- Remove the working-set algorithm. Instead, use the per cpu buckets as the
working set cache. This has several advantages. Firstly, we never touch
the per cpu queues now in the timeout handler. This removes one more
reason for having per cpu locks. Secondly, it reduces the size of the zone
by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This
tidies up other logic as well.
- The 'destroy' flag no longer needs to be passed to zone_drain() since it
always frees everything in the zone's slabs.
- cache_drain() is now only called from zone_dtor() and so it destroys by
default. It also does not need the destroy parameter now.


# 120255 19-Sep-2003 jeff

- Remove the cache colorization code. We can't use it due to all of the
broken consumers of the malloc interface who assume that the allocated
address will be an even multiple of the size.
- Remove disabled time delay code on uma_reclaim(). The comment there said
it all. It was not an effective strategy and it should not be left in
#if 0'd for all eternity.


# 120249 19-Sep-2003 jeff

- There are an endless stream of style(9) errors in this file. Fix a few.
Also catch some spelling errors.


# 120229 19-Sep-2003 jeff

- Don't inspect the zone in page_alloc(). It may be NULL.
- Don't cache more items than the zone would like in uma_zalloc_bucket().


# 120224 19-Sep-2003 jeff

- Move the logic for dealing with the uma_boot_pages cache into the
page_alloc() function from the slab_zalloc() function. This allows us
to unconditionally call uz_allocf().
- In page_alloc() cleanup the boot_pages logic some. Previously memory from
this cache that was not used by the time the system started was left in
the cache and never used. Typically this wasn't more than a few pages,
but now we will use this cache so long as memory is available.


# 120223 19-Sep-2003 jeff

- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags
by accepting the user supplied flags directly. Previously this was not
done so that flags for the same field would not be defined in two
different files. Add comments in each header instructing future
developers on how now to shoot their feet.
- Fix a test for !OFFPAGE which should have been a test for HASH. This would
have caused a panic if we had ever destructed a malloc zone. This also
opens up the possibility that other zones could use the vsetobj() method
rather than a hash.


# 120221 19-Sep-2003 jeff

- Don't abuse M_DEVBUF, define a tag for UMA hashes.


# 120219 19-Sep-2003 jeff

- Eliminate a pair of unnecessary variables.


# 120218 19-Sep-2003 jeff

- Initialize a pool of bucket zones so that we waste less space on zones that
don't cache as many items.
- Introduce the bucket_alloc(), bucket_free() functions to wrap bucket
allocation. These functions select the appropriate bucket zone to
allocate from or free to.
- Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects
the count of free items in the bucket. This gets rid of many unnatural
subtractions by 1 throughout the code.
- Add ub_entries which reflects the number of entries possibly held in a
bucket.


# 119182 20-Aug-2003 bmilekic

In sysctl_vm_zone, do not calculate per-cpu cache stats on
UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha
was crashing while entering multi-user because, I think, we
were calculating the garbage cachefree for pcpu caches that
essentially don't exist for at least the 'zones' zone and it so
happened that we were reading from an unmapped location.

Confirmed to fix crash: wilko
Helped debug: wilko, gallatin


# 118795 11-Aug-2003 bmilekic

- When deciding whether to init the zone with small_init or large_init,
compare the zone element size (+1 for the byte of linkage) against
UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
Add a KASSERT in zone_small_init to make sure that the computed
ipers (items per slab) for the zone is not zero, despite the addition
of the check, just to be sure (this part submitted by: silby)

- UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies
CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the
case of bucket allocations, but in addition to that also ensures that
we don't setup the zone with OFFPAGE slab headers allocated from the
slabzone. This means that we're not allowed to have a UMA_ZONE_VM
zone initialized for large items (zone_large_init) because it would
require the slab headers to be allocated from slabzone, and hence
kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd
before kmem_map is suballoc'd from kernel_map, which is why this
change is necessary.


# 118380 03-Aug-2003 alc

Revise obj_alloc(). Most notably, use the object's lock to prevent two
concurrent invocations from acquiring the same address(es). Also, in case
of an incomplete allocation, free any allocated pages.

In collaboration with: tegge


# 118369 02-Aug-2003 bmilekic

When INVARIANTS is on and we're in uma_zalloc_free(), we need to make
sure that uma_dbg_free() is called if we're about to call
uma_zfree_internal() but we're asking it to skip the dtor and
uma_dbg_free() call itself. So, if we're about to call
uma_zfree_internal() from uma_zfree_arg() and skip == 1, call
uma_dbg_free() ourselves.


# 118315 01-Aug-2003 bmilekic

Only free the pcpu cache buckets if they are non-NULL.

Crashed this person's machine: harti
Pointy-hat to: me


# 118221 30-Jul-2003 bmilekic

Plug a race and a leak in UMA.

1) The race has to do with zone destruction. From the zone destructor we
would lock the zone, set the working set size to 0, then unlock the zone,
drain it, and then free the structure. Within the window following the
working-set-size set to 0 and unlocking of the zone and the point where
in zone_drain we re-acquire the zone lock, the uma timer routine could
have fired off and changed the working set size to something non-zero,
thereby potentially preventing us from completely freeing slabs before
destroying the zone (and thus leaking them).

2) The leak has to do with zone destruction as well. When destroying a
zone we would take care to free all the buckets cached in the zone, but
although we would drain the pcpu cache buckets, we would not free them.
This resulted in leaking a couple of bucket structures (512 bytes each)
per cpu on SMP during zone destruction.

While I'm here, also silence GCC warnings by turning uma_slab_alloc()
from inline to real function. It's too big to be an inline.

Reviewed by: JeffR


# 118212 30-Jul-2003 bmilekic

When generating the zone stats make sure to handle the master zone
("UMA Zone") carefully, because it does not have pcpu caches allocated
at all. In the UP case, we did not catch this because one pcpu cache
is always allocated with the zone, but for the MP case, we were getting
bogus stats for this zone.

Tested by: Lukas Ertl <le@univie.ac.at>


# 118201 30-Jul-2003 phk

Remove the disabling of buckets workaround.

Thanks to: jeffr


# 118190 30-Jul-2003 jeff

- Get rid of the ill-conceived uz_cachefree member of uma_zone.
- In sysctl_vm_zone use the per cpu locks to read the current cache
statistics this makes them more accurate while under heavy load.

Submitted by: tegge


# 118189 30-Jul-2003 jeff

- Check to see if we need a slab prior to allocating one. Failure to do
so not only wastes memory but it can also cause a leak in zones that
will be destroyed later. The problem is that the slab allocation code
places newly created slabs on the partially allocated list because it
assumes that the caller will actually allocate some memory from it.
Failure to do so places an otherwise free slab on the partial slab list
where we wont find it later in zone_drain().

Continuously prodded to fix by: phk (Thanks)


# 118187 29-Jul-2003 phk

Temporary workaround: Always disable buckets, there is a bug there
somewhere.

JeffR will look at this as soon as he has time.

OK'ed by: jeffr


# 118104 28-Jul-2003 alc

None of the "alloc" functions used by UMA assume that Giant is held any
longer. (If they still need it, e.g., contigmalloc(), they acquire it
themselves.) Therefore, we need not acquire Giant in slab_zalloc().


# 118040 26-Jul-2003 alc

Gulp ... call kmem_malloc() without Giant.


# 117736 18-Jul-2003 harti

When INVARIANTS is defined make sure that uma_zalloc_arg (and hence
uma_zalloc) is called with exactly one of either M_WAITOK or M_NOWAIT and
that it is called with neither M_TRYWAIT or M_DONTWAIT. Print a warning
if anything is wrong. Default to M_WAITOK of no flag is given. This is the
same test as in malloc(9).


# 116837 25-Jun-2003 bmilekic

Move the pcpu lock out of the uma_cache and instead have a single set
of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).

No Objections from jeff.


# 116829 25-Jun-2003 bmilekic

Make sure that the zone destructor doesn't get called twice in
certain free paths.


# 116226 11-Jun-2003 obrien

Use __FBSDID().


# 116131 09-Jun-2003 phk

Revert last commit, I have no idea what happened.


# 116117 09-Jun-2003 phk

A white-space nit I noticed.


# 114149 28-Apr-2003 alc

uma_zone_set_obj() must perform VM_OBJECT_LOCK_INIT() if the caller
provides storage for the vm_object.


# 114052 26-Apr-2003 alc

Remove an XXX comment. It is no longer a problem.


# 113699 18-Apr-2003 alc

Lock the vm_object in obj_alloc().


# 113665 18-Apr-2003 gallatin

Don't grab Giant in slab_zalloc() if M_NOWAIT is specified. This
should allow the use of INTR_MPSAFE network drivers.

Tested by: njl
Glanced at by: jeff


# 112683 26-Mar-2003 tegge

Obtain Giant before calling kmem_alloc without M_NOWAIT and before calling
kmem_free if Giant isn't already held.


# 111883 04-Mar-2003 jhb

Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls
to WITNESS_WARN().


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110313 04-Feb-2003 phk

Change a printf to also tell how many items were left in the zone.


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109548 19-Jan-2003 jeff

- M_WAITOK is 0 and not a real flag. Test for this properly.

Submitted by: tmm
Pointy hat to: jeff


# 108533 01-Jan-2003 schweikh

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# 107048 18-Nov-2002 jeff

- Wakeup the correct address when a zone is no longer full.

Spotted by: jake


# 106992 16-Nov-2002 jeff

- Don't forget the flags value when using boot pages.

Reported by: grehan


# 106773 11-Nov-2002 mjacob

atomic_set_8 isn't MI. Instead, follow Jake's suggestions about
ZONE_LOCK.


# 106277 31-Oct-2002 jeff

- Add support for machine dependant page allocation routines. MD code
may define UMA_MD_SMALL_ALLOC to make use of this feature.

Reviewed by: peter, jake


# 105853 24-Oct-2002 jeff

- Now that uma_zalloc_internal is not the fast path don't be so fussy about
extra function calls. Refactor uma_zalloc_internal into seperate functions
for finding the most appropriate slab, filling buckets, allocating single
items, and pulling items off of slabs. This makes the code significantly
cleaner.
- This also fixes the "Returning an empty bucket." panic that a few people
have seen.

Tested On: alpha, x86


# 105848 24-Oct-2002 jeff

- Move the destructor calls so that they are not called with the zone lock
held. This avoids a lock order reversal when destroying zones.
Unfortunately, this also means that the free checks are not done before
the destructor is called.

Reported by: phk


# 104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 103623 19-Sep-2002 jeff

- Use my freebsd email alias in the copyright.
- Remove redundant instances of my email alias in the file summary.


# 103531 18-Sep-2002 jeff

- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.


# 102241 21-Aug-2002 archie

Don't use "NULL" when "0" is really meant.


# 99472 05-Jul-2002 jeff

Fix a lock order reversal in uma_zdestroy. The uma_mtx needs to be held across
calls to zone_drain().

Noticed by: scottl


# 99424 05-Jul-2002 jeff

Remove unnecessary includes.


# 99320 02-Jul-2002 jeff

Actually use the fini callback.

Pointy hat to: me :-(
Noticed By: Julian


# 98822 25-Jun-2002 jeff

Reduce the amount of code that runs with the zone lock held in slab_zalloc().
This allows us to run the zone initialization functions without any locks held.


# 98451 19-Jun-2002 jeff

- Remove bogus use of kmem_alloc that was inherited from the old zone
allocator.
- Properly set M_ZERO when talking to the back end page allocators for
non malloc zones. This forces us to zero fill pages when they are first
brought into a cache.
- Properly handle M_ZERO in uma_zalloc_internal. This fixes a problem where
per cpu buckets weren't always getting zeroed.


# 98362 17-Jun-2002 jeff

Honor the BUCKETCACHE flag on free as well.


# 98361 17-Jun-2002 jeff

- Introduce the new M_NOVM option which tells uma to only check the currently
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.

- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.

- Use the ZONE_VM flag on vm map entries and pv entries on x86.


# 98075 10-Jun-2002 iedowse

Correct the logic for determining whether the per-CPU locks need
to be destroyed. This fixes a problem where destroying a UMA zone
would fail to destroy all zone mutexes.

Reviewed by: jeff


# 97787 03-Jun-2002 jeff

Add a comment describing a resource leak that occurs during a failure case
in obj_alloc.


# 97007 20-May-2002 jhb

In uma_zalloc_arg(), if we are performing a M_WAITOK allocation, ensure
that td_intr_nesting_level is 0 (like malloc() does). Since malloc() calls
uma we can probably remove the check in malloc() for this now. Also,
perform an extra witness check in that case to make sure we don't hold
any locks when performing a M_WAITOK allocation.


# 96496 13-May-2002 jeff

Don't call the uz free function while the zone lock is held. This can lead
to lock order reversals. uma_reclaim now builds a list of freeable slabs and
then unlocks the zones to do all of the frees.


# 96493 13-May-2002 jeff

Remove the hash_free() lock order reversal. This could have happened for
several reasons before. Fixing it involved restructuring the generic hash
code to require calling code to handle locking, unlocking, and freeing hashes
on error conditions.


# 96044 04-May-2002 jeff

Use pages instead of uz_maxpages, which has not been initialized yet, when
creating the vm_object. This was broken after the code was rearranged to
grab giant itself.

Spotted by: alc


# 95930 02-May-2002 jeff

Move around the dbg code a bit so it's always under a lock. This stops a
weird potential race if we were preempted right as we were doing the dbg
checks.


# 95925 02-May-2002 arr

- Changed the size element of uma_zctor_args to be size_t instead of int.
- Changed uma_zcreate to accept the size argument as a size_t intead of
int.

Approved by: jeff


# 95923 02-May-2002 jeff

malloc/free(9) no longer require Giant. Use the malloc_mtx to protect the
mallochash. Mallochash is going to go away as soon as I introduce the
kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen
until all users of the malloc api that expect memory to be aligned on the size
of the allocation are fixed.


# 95899 02-May-2002 jeff

Remove the temporary alignment check in free().

Implement the following checks on freed memory in the bucket path:
- Slab membership
- Alignment
- Duplicate free

This previously was only done if we skipped the buckets. This code will slow
down INVARIANTS a bit, but it is smp safe. The checks were moved out of the
normal path and into hooks supplied in uma_dbg.


# 95766 30-Apr-2002 jeff

Move the implementation of M_ZERO into UMA so that it can be passed to
uma_zalloc and friends. Remove this functionality from the malloc wrapper.

Document this change in uma.h and adjust variable names in uma_core.


# 95758 29-Apr-2002 jeff

Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own
mutex class. Currently this is only used for kmapentzone because kmapents
are are potentially allocated when freeing memory. This is not dangerous
though because no other allocations will be done while holding the
kmapentzone lock.


# 95432 25-Apr-2002 arr

- Fix a round down bogon in uma_zone_set_max().

Submitted by: jeff@


# 94653 14-Apr-2002 jeff

Fix a witness warning when expanding a hash table. We were allocating the new
hash while holding the lock on a zone. Fix this by doing the allocation
seperately from the actual hash expansion.

The lock is dropped before the allocation and reacquired before the expansion.
The expansion code checks to see if we lost the race and frees the new hash
if we do. We really never will lose this race because the hash expansion is
single threaded via the timeout mechanism.


# 94651 14-Apr-2002 jeff

Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx.


# 94631 13-Apr-2002 jeff

Fix the calculation that determines uz_maxpages. It was off for large zones.
Fortunately we have no large zones with maximums specified yet, so it wasn't
breaking anything.

Implement blocking when a zone exceeds the maximum and M_WAITOK is specified.
Previously this just failed like the old zone allocator did. The old zone
allocator didn't support WAITOK/NOWAIT though so we should do what we
advertise.

While I was in there I cleaned up some more zalloc logic to further simplify
that code path and reduce redundant code. This was needed to make the blocking
work properly anyway.


# 94329 09-Apr-2002 jeff

Remember to unlock the zone if the fill count is too high.

Pointed out by: pete, jake, jhb


# 94165 08-Apr-2002 jeff

Add a mechanism to disable buckets when the v_free_count drops below
v_free_min. This should help performance in memory starved situations.


# 94163 08-Apr-2002 jeff

Don't release the zone lock until after the dtor has been called. As far as I
can tell this could not have caused any problems yet because UMA is still
called with giant.

Pointy hat to: jeff
Noticed by: jake


# 94161 08-Apr-2002 jeff

Implement uma_zdestroy(). It's prototype changed slightly. I decided that I
didn't like the wait argument and that if you were removing a zone it had
better be empty.

Also, I broke out part of hash_expand and made a seperate hash_free() for use
in uma_zdestroy.


# 94159 08-Apr-2002 jeff

Rework most of the bucket allocation and free code so that per cpu locks are
never held across blocking operations. Also, fix two other lock order
reversals that were exposed by jhb's witness change.

The free path previously had a bug that would cause it to skip the free bucket
list in some cases and go straight to allocating a new bucket. This has been
fixed as well.

These changes made the bucket handling code much cleaner and removed quite a
few lock operations. This should be marginally faster now.

It is now possible to call malloc w/o Giant and avoid any witness warnings.
This still isn't entirely safe though because malloc_type statistics are not
protected by any lock.


# 94155 07-Apr-2002 jeff

This fixes a bug where isitem never got set to 1 if a certain chain of events
relating to extreme low memory situations occured. This was only ever seen on
the port build cluster, so many thanks to kris for helping me debug this.

Tested by: kris


# 93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 93697 02-Apr-2002 alfred

fix comment typo, s/neccisary/necessary/g


# 93089 24-Mar-2002 jeff

Reset the cachefree statistics after draining the cache. This fixes a bug
where a sysctl within 20 seconds of a cache_drain could yield negative "USED"
counts.

Also, grab the uma_mtx while in the sysctl handler. This hadn't caused
problems yet because Giant is held all the time.

Reported by: kkenn


# 92758 20-Mar-2002 jeff

Add uma_zone_set_max() to add enforced limits to non vm obj backed zones.


# 92654 19-Mar-2002 jeff

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@