Cross Reference: /freebsd-10.0-release/sys/vm/uma

History log of /freebsd-10.0-release/sys/vm/uma_core.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 259065	07-Dec-2013	gjb	- Copy stable/10 (r259064) to releng/10.0 as part of the 10.0-RELEASE cycle. - Update __FreeBSD_version [1] - Set branch name to -RC1 [1] 10.0-CURRENT __FreeBSD_version value ended at '55', so start releng/10.0 at '100' so the branch is started with a value ending in zero. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.0-release /freebsd-10.0-release/sys/conf/newvers.sh /freebsd-10.0-release/sys/sys/param.h
# 258911	04-Dec-2013	rodrigc	MFC r258737 In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty" message printed to the console. This makes it easier to track down the source of certain memory leaks. Suggested by: adrian Approved by: re (gjb)
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 255097	31-Aug-2013	mckusick	Fix bug introduced in rewrite of keg_free_slab in -r251894. The consequence of the bug is that fini calls are not done when a slab is freed by a call-back from the page daemon. It went unnoticed for two months because fini is little used. I spotted the bug while reading the code to learn how it works so I could write it up for the next edition of the Design and Implementation of FreeBSD book. No MFC needed as this code exists only in HEAD. Reviewed by: kib, jeff Tested by: pho
# 254182	10-Aug-2013	kib	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 254025	07-Aug-2013	jeff	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 253565	23-Jul-2013	glebius	Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>
# 252358	28-Jun-2013	davide	Remove a spurious keg lock acquisition.
# 252226	25-Jun-2013	jeff	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division
# 252040	20-Jun-2013	jeff	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division
# 251983	19-Jun-2013	jeff	- Persist the caller's flags in the bucket allocation flags so we don't lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division
# 251894	18-Jun-2013	jeff	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division
# 251826	17-Jun-2013	jeff	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division
# 251709	13-Jun-2013	jeff	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division
# 249763	22-Apr-2013	glebius	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.
# 249313	09-Apr-2013	glebius	Convert UMA code to C99 uintXX_t types.
# 249305	09-Apr-2013	glebius	Fix KASSERTs: maximum number of items per slab is 256.
# 249264	08-Apr-2013	glebius	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.
# 248084	09-Mar-2013	attilio	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 247360	26-Feb-2013	attilio	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide
# 246087	29-Jan-2013	glebius	Fix typo in debug printf.
# 243998	07-Dec-2012	pjd	Implemented uma_zone_set_warning(9) function that sets a warning, which will be printed once the given zone becomes full and cannot allocate an item. The warning will not be printed more often than every five minutes. All UMA warnings can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
# 242152	26-Oct-2012	mdf	Const-ify the zone name argument to uma_zcreate(9). MFC after: 3 days
# 241825	22-Oct-2012	eadler	Print flags as hex instead of an integer. PR: kern/168210 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days
# 240676	18-Sep-2012	glebius	If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory in an allocation for a slab. Reviewed by: jeff
# 239710	26-Aug-2012	glebius	Fix function name in keg_cachespread_init() assert.
# 238206	07-Jul-2012	eadler	Add missing sleep stat increase PR: kern/168211 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days
# 238000	02-Jul-2012	jhb	Honor db_pager_quit in 'show uma' and 'show malloc'. MFC after: 1 month
# 235854	23-May-2012	emax	Tweak condition for disabling allocation from per-CPU buckets in low memory situation. I've observed a situation where per-CPU allocations were disabled while there were enough free cached pages. Basically, cnt.v_free_count was sitting stable at a value lower than cnt.v_free_min and that caused massive performance drop. Reviewed by: alc MFC after: 1 week
# 230623	27-Jan-2012	kmacy	exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64 excluding other allocations including UMA now entails the addition of a single flag to kmem_alloc or uma zone create Reviewed by: alc, avg MFC after: 2 weeks
# 226313	12-Oct-2011	glebius	Make memguard(9) capable to guard uma(9) allocations.
# 222184	22-May-2011	alc	Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined, startup_alloc() must be used until uma_startup2() is called. Reported by: jh
# 222163	21-May-2011	alc	1. Prior to r214782, UMA did not support multipage allocations before uma_startup2() was called. Thus, setting the variable "booted" to true in uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because any allocations made after uma_startup() but before uma_startup2() could be satisfied by uma_small_alloc(). Now, however, some multipage allocations are necessary before uma_startup2() just to allocate zone structures on machines with a large number of processors. Thus, a Boolean can no longer effectively describe the state of the UMA allocator. Instead, make "booted" have three values to describe how far initialization has progressed. This allows multipage allocations to continue using startup_alloc() until uma_startup2(), but single-page allocations may begin using uma_small_alloc() after uma_startup(). 2. With the aforementioned change, only a modest increase in boot pages is necessary to boot UMA on a large number of processors. 3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between r182028 and r204128. Reviewed by: attilio [1], nwhitehorn [3] Tested by: sbruno
# 222132	20-May-2011	alc	Eliminate a redundant #include. ("vm/vm_param.h" already includes "machine/vmparam.h".)
# 219819	21-Mar-2011	jeff	- Merge changes to the base system to support OFED. These include a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND, and other miscellaneous small features.
# 217916	26-Jan-2011	mdf	Explicitly wire the user buffer rather than doing it implicitly in sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT drain for extremely large amounts of data where the caller knows that appropriate references are held, and sleeping is not an issue. Inspired by: rwatson
# 214782	04-Nov-2010	jhb	Update startup_alloc() to support multi-page allocations and allow internal zones whose objects are larger than a page to use startup_alloc(). This allows allocation of zone objects during early boot on machines with a large number of CPUs since the resulting zone objects are larger than a page. Submitted by: trema Reviewed by: attilio MFC after: 1 week
# 214062	19-Oct-2010	mdf	uma_zfree(zone, NULL) should do nothing, to match free(9). Noticed by: Ron Steinke <rsteinke at isilon dot com> MFC after: 3 days
# 213911	16-Oct-2010	lstewart	Change uma_zone_set_max to return the effective value of "nitems" after rounding. The same value can also be obtained with uma_zone_get_max, but this change avoids a caller having to make two back-to-back calls. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb
# 213910	16-Oct-2010	lstewart	- Simplify implementation of uma_zone_get_max. - Add uma_zone_get_cur which returns the current approximate occupancy of a zone. This is useful for providing stats via sysctl amongst other things. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb MFC after: 2 weeks
# 212750	16-Sep-2010	mdf	Re-add r212370 now that the LOR in powerpc64 has been resolved: Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough SBUF_FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk (original patch)
# 212572	13-Sep-2010	mdf	Revert r212370, as it causes a LOR on powerpc. powerpc does a few unexpected things in copyout(9) and so wiring the user buffer is not sufficient to perform a copyout(9) while holding a random mutex. Requested by: nwhitehorn
# 212370	09-Sep-2010	mdf	Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk
# 211396	16-Aug-2010	andre	Add uma_zone_get_max() to obtain the effective limit after a call to uma_zone_set_max(). The UMA zone limit is not exactly set to the value supplied but rounded up to completely fill the backing store increment (a page normally). This can lead to surprising situations where the number of elements allocated from UMA is higher than the supplied limit value. The new get function reads back the effective value so that the supplied limit value can be adjusted to the real limit. Reviewed by: jeffr MFC after: 1 week
# 209215	15-Jun-2010	sbruno	Add a new column to the output of vmstat -z to indicate the number of times the system was forced to sleep when requesting a new allocation. Expand the debugger hook, db_show_uma, to display these results as well. This has proven to be very useful in out of memory situations when it is not known why systems have become sluggish or fail in odd ways. Reviewed by: rwatson alc Approved by: scottl (mentor) peter Obtained from: Yahoo Inc.
# 209059	11-Jun-2010	jhb	Update several places that iterate over CPUs to use CPU_FOREACH().
# 207576	03-May-2010	alc	It makes more sense for the object-based backend allocator to use OBJT_PHYS objects instead of OBJT_DEFAULT objects because we never reclaim or pageout the allocated pages. Moreover, they are mapped with pmap_qenter(), which creates unmanaged mappings. Reviewed by: kib
# 207410	29-Apr-2010	kmacy	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
# 201145	28-Dec-2009	antoine	(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument. Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version) MFC after: 1 month
# 194429	18-Jun-2009	alc	Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an unnecessary newline character from the end of two panic messages.)
# 187681	25-Jan-2009	jeff	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia
# 182047	23-Aug-2008	antoine	Remove unused variable nosleepwithlocks. PR: 126609 Submitted by: Mateusz Guzik MFC after: 1 month X-MFC: to stable/7 only, this variable is still used in stable/6
# 182028	22-Aug-2008	nwhitehorn	Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator. Approved by: marcel (mentor)
# 177921	04-Apr-2008	alc	Reintroduce UMA_SLAB_KMAP; however, change its spelling to UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM. (UMA_SLAB_KMAP met its original demise in revision 1.30 of vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame allocators. Without it, UMA cannot correctly return pages from the jumbo frame zones to the VM system because it resets the pages' object field to NULL instead of the kernel object. In more detail, the jumbo frame zones are created with the option UMA_ZONE_REFCNT. This causes UMA to overwrite the pages' object field with the address of the slab. However, when UMA wants to release these pages, it doesn't know how to restore the object field, so it sets it to NULL. This change teaches UMA how to reset the object field to the kernel object. Crashes reported by: kris Fix tested by: kris Fix discussed with: jeff MFC after: 6 weeks
# 172545	11-Oct-2007	jhb	Allow recursion on the 'zones' internal UMA zone. Submitted by: thompsa MFC after: 1 week Approved by: re (kensmith) Discussed with: jeff
# 170170	31-May-2007	attilio	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 169667	18-May-2007	jeff	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 166654	11-Feb-2007	rwatson	Add uma_set_align() interface, which will be called at most once during boot by MD code to indicated detected alignment preference. Rather than cache alignment being encoded in UMA consumers by defining a global alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now a special value (-1) that causes UMA to look at registered alignment. If no preferred alignment has been selected by MD code, a default alignment of (16 - 1) will be used. Currently, no hardware platforms specify alignment; architecture maintainers will need to modify MD startup code to specify an alignment if desired. This must occur before initialization of UMA so that all UMA zones pick up the requested alignment. Reviewed by: jeff, alc Submitted by: attilio
# 166213	24-Jan-2007	mohans	Fix for problems that occur when all mbuf clusters migrate to the mbuf packet zone. Cluster allocations fail when this happens. Also processes that may have blocked on cluster allocations will never be woken up. Thanks to rwatson for an overview of the issue and pointers to the mbuma paper and his tool to dump out UMA zones. Reviewed by: andre@
# 166211	24-Jan-2007	mohans	Fix for a bug where only one process (of multiple) blocked on maxpages on a zone is woken up, with the rest never being woken up as a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked procsses instead. This change introduces a thundering herd, but since this should be relatively infrequent, optimizing this (by introducing a count of blocked processes, for example) may be premature. Reviewd by: ups@
# 165928	10-Jan-2007	rwatson	Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when allocations were made using improper flags in interrupt context. Replace with a simple WITNESS warning call. This restores the invariant that M_WAITOK allocations will always succeed or die horribly trying, which is relied on by many UMA consumers. MFC after: 3 weeks Discussed with: jhb
# 165809	05-Jan-2007	jhb	- Add a new function uma_zone_exhausted() to see if a zone is full. - Add a printf in swp_pager_meta_build() to warn if the swapzone becomes exhausted so that there's at least a warning before a box that runs out of swapzone space before running out of swap space deadlocks. MFC after: 1 week Reviwed by: alc
# 163702	26-Oct-2006	rwatson	Better align output of "show uma" by moving from displaying the basic counters of allocs/frees/use for each zone to the same statistics shown by userspace "vmstat -z". MFC after: 3 days
# 160460	17-Jul-2006	rwatson	Fix build of uma_core.c when DDB is not compiled into the kernel by making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now. Submitted by: Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>
# 160414	16-Jul-2006	rwatson	Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x, libmemstat(3) is used by vmstat (and friends) to produce more accurate and more detailed statistics information in a machine-readable way, and vmstat continues to provide the same text-based front-end. This change should not be MFC'd.
# 158803	21-May-2006	rwatson	When allocating a bucket to hold a free'd item in UMA fails, don't report this as an allocation failure for the item type. The failure will be separately recorded with the bucket type. This my eliminate high mbuf allocation failure counts under some circumstances, which can be alarming in appearance, but not actually a problem in practice. MFC after: 2 weeks Reported by: ps, Peter J. Blok <pblok at bsd4all dot org>, OxY <oxy at field dot hu>, Gabor MICSKO <gmicskoa at szintezis dot hu>
# 155551	11-Feb-2006	rwatson	Skip per-cpu caches associated with absent CPUs when generating a memory statistics record stream via sysctl. MFC after: 3 days
# 154934	27-Jan-2006	jhb	Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function. The difference between WITNESS_CHECK() and WITNESS_WARN() is that WITNESS_CHECK() should be used in the places that the return value of witness_warn() is checked, whereas WITNESS_WARN() should be used in places where the return value is ignored. Specifically, in a kernel without WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as WITNESS_CHECK evaluates to 0. I also updated the one place that was checking the return value of WITNESS_WARN() to use WITNESS_CHECK.
# 154076	06-Jan-2006	jhb	Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro and trim another unneeded #ifdef (it was just around a macro that is already conditionally defined).
# 151526	20-Oct-2005	rwatson	Change format string for u_int64_t to %ju from %llu, in order to use the correct format string on 64-bit systems. Pointed out by: pjd
# 151516	20-Oct-2005	rwatson	Add a "show uma" command to DDB, which prints out the current stats for available UMA zones. Quite useful for post-mortem debugging of memory leaks without a dump device configured on a panicked box. MFC after: 2 weeks
# 151104	08-Oct-2005	des	As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup() still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages as an additional argument. MFC after: 2 weeks
# 149900	09-Sep-2005	alc	Introduce a new lock for the purpose of synchronizing access to the UMA boot pages. Disable recursion on the general UMA lock now that startup_alloc() no longer uses it. Eliminate the variable uma_boot_free. It serves no purpose. Note: This change eliminates a lock-order reversal between a system map mutex and the UMA lock. See http://sources.zabbadoz.net/freebsd/lor.html#109 for details. MFC after: 3 days
# 148371	24-Jul-2005	rwatson	Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the monitoring API, which might or might not be the same as the internal maximum (currently none). Export flag information on UMA zones -- in particular, whether or not this is a secondary zone, and so the keg free count should be considered in that light. MFC after: 1 day
# 148194	20-Jul-2005	rwatson	Further UMA statistics related changes: - Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to to update the zone's uz_frees statistic. Previously, the statistic was updated unconditionally. - Use the flag in situations where a "real" free occurs: i.e., one where the caller is freeing an allocated item, to be differentiated from situations where uma_zfree_internal() is used to tear down the item during slab teardown in order to invoke its fini() method. Also use the flag when UMA is freeing its internal objects. - When exchanging a bucket with the zone from the per-CPU cache when freeing an item, flush cache statistics back to the zone (since the zone lock and critical section are both held) to match the allocation case. MFC after: 3 days
# 148079	16-Jul-2005	rwatson	Use mp_maxid in preference to MAXCPU when creating exports of UMA per-CPU cache statistics. UMA sizes the cache array based on the number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU could read off the end of the array (into the next zone). Reported by: yongari MFC after: 1 week
# 148078	16-Jul-2005	rwatson	Improve canonicalization of copyrights. Order copyrights by order of assertion (jeff, bmilekic, rwatson). Suggested ages ago by: bde MFC after: 1 week
# 148077	16-Jul-2005	rwatson	Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that it covers the following of the uc_alloc/freebucket cache pointers. Originally, I felt that the race wasn't helped by holding the mutex, hence a comment in the code and not holding it across the cache access. However, it does improve consistency, as while it doesn't prevent bucket exchange, it does prevent bucket pointer invalidation. So a race in gathering cache free space statistics still can occur, but not one that follows an invalid bucket pointer, if the mutex is held. Submitted by: yongari MFC after: 1 week
# 148072	16-Jul-2005	silby	Increase the flags field for kegs from a 16 to a 32 bit value; we have exhausted all 16 flags.
# 148070	15-Jul-2005	rwatson	Track UMA(9) allocation failures by zone, and export via sysctl. Requested by: victor cruceru <victor dot cruceru at gmail dot com> MFC after: 1 week
# 147996	14-Jul-2005	rwatson	Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator statistics via a binary structure stream: - Add structure 'uma_stream_header', which defines a stream version, definition of MAXCPUs used in the stream, and the number of zone records in the stream. - Add structure 'uma_type_header', which defines the name, alignment, size, resource allocation limits, current pages allocated, preferred bucket size, and central zone + keg statistics. - Add structure 'uma_percpu_stat', which, for each per-CPU cache, includes the number of allocations and frees, as well as the number of free items in the cache. - When the sysctl is queried, return a stream header, followed by a series of type descriptions, each consisting of a type header followed by a series of MAXCPUs uma_percpu_stat structures holding per-CPU allocation information. Typical values of MAXCPU will be 1 (UP compiled kernel) and 16 (SMP compiled kernel). This query mechanism allows user space monitoring tools to extract memory allocation statistics in a machine-readable form, and to do so at a per-CPU granularity, allowing monitoring of allocation patterns across CPUs in order to better understand the distribution of work and memory flow over multiple CPUs. While here, also export the number of UMA zones as a sysctl vm.uma_count, in order to assist in sizing user swpace buffers to receive the stream. A follow-up commit of libmemstat(3), a library to monitor kernel memory allocation, will occur in the next few days. This change directly supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats rather than separately maintained mbuf allocator statistics. MFC after: 1 week
# 147995	14-Jul-2005	rwatson	In addition to tracking allocs in the zone, also track frees. Add a zone free counter, as well as a cache free counter. MFC after: 1 week
# 147994	14-Jul-2005	rwatson	In an earlier world order, UMA would flush per-CPU statistics to the zone whenever it was moving buckets between the zone and the cache, or when coalescing statistics across the CPU. Remove flushing of statistics to the zone when coalescing statistics as part of sysctl, as we won't be running on the right CPU to write to the cache statistics. Add a missed gathering of statistics: when uma_zalloc_internal() does a special case allocation of a single item, make sure to update the zone statistics to represent this. Previously this case wasn't accounted for in user-visible statistics. MFC after: 1 week
# 145686	29-Apr-2005	rwatson	Modify UMA to use critical sections to protect per-CPU caches, rather than mutexes, which offers lower overhead on both UP and SMP. When allocating from or freeing to the per-cpu cache, without INVARIANTS enabled, we now no longer perform any mutex operations, which offers a 1%-3% performance improvement in a variety of micro-benchmarks. We rely on critical sections to prevent (a) preemption resulting in reentrant access to UMA on a single CPU, and (b) migration of the thread during access. In the event we need to go back to the zone for a new bucket, we release the critical section to acquire the global zone mutex, and must re-acquire the critical section and re-evaluate which cache we are accessing in case migration has occured, or circumstances have changed in the current cache. Per-CPU cache statistics are now gathered lock-free by the sysctl, which can result in small races in statistics reporting for caches. Reviewed by: bmilekic, jeff (somewhat) Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others
# 142368	24-Feb-2005	alc	Forced commit: The previous revision's message should have referred to revision 1.115, not revision 1.114.
# 142367	24-Feb-2005	alc	Revert the first part of revision 1.114 and modify the second part. On architectures implementing uma_small_alloc() pages do not necessarily belong to the kmem object.
# 141991	16-Feb-2005	bmilekic	Well, it seems that I pre-maturely removed the "All rights reserved" statement from some files, so re-add it for the moment, until the related legalese is sorted out. This change affects: sys/kern/kern_mbuf.c sys/vm/memguard.c sys/vm/memguard.h sys/vm/uma.h sys/vm/uma_core.c sys/vm/uma_dbg.c sys/vm/uma_dbg.h sys/vm/uma_int.h
# 141983	16-Feb-2005	bmilekic	Make UMA set the overloaded page->object back to kmem_object for UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly came from kmem_map for those two. Previously it would set it back to NULL for UMA_ZONE_REFCNT zones and although this was probably not fatal, it added MORE code for no reason.
# 140031	11-Jan-2005	bmilekic	While we want the recursion protection for the bucket zones so that recursion from the VM is handled (and the calling code that allocates buckets knows how to deal with it), we do not want to prevent allocation from the slab header zones (slabzone and slabrefzone) if uk_recurse is not zero for them. The reason is that it could lead to NULL being returned for the slab header allocations even in the M_WAITOK case, and the caller can't handle that (this is also explained in a comment with this commit). The problem analysis is documented in our mailing lists: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current (see entire thread for proper context). Crash dump data provided by: Peter Holm <peter@holm.cc>
# 139996	10-Jan-2005	stefanf	ISO C requires at least one element in an initialiser list.
# 139825	07-Jan-2005	imp	/* -> /*- for license, minor formatting changes
# 139318	25-Dec-2004	bmilekic	Add my copyright and update Jeff's copyright on UMA source files, as per his request. Discussed with: Jeffrey Roberson
# 137309	06-Nov-2004	rwatson	Abstract the logic to look up the uma_bucket_zone given a desired number of entries into bucket_zone_lookup(), which helps make more clear the logic of consumers of bucket zones. Annotate the behavior of bucket_init() with a comment indicating how the various data structures, including the bucket lookup tables, are initialized.
# 137305	06-Nov-2004	rwatson	Annotate what bucket_size[] array does; staticize since it's used only in uma_core.c.
# 137001	27-Oct-2004	bmilekic	Fix a INVARIANTS-only bug introduced in Revision 1.104: IF INVARIANTS is defined, and in the rare case that we have allocated some objects from the slab and at least one initializer on at least one of those objects failed, and we need to fail the allocation and push the uninitialized items back into the slab caches -- in that scenario, we would fail to [re]set the bucket cache's ub_bucket item references to NULL, which would eventually trigger a KASSERT.
# 136334	09-Oct-2004	green	In the previous revision, I did not intend to change the default value of "nosleepwithlocks." Submitted by: ru
# 136276	08-Oct-2004	green	Fix critical stability problems that can cause UMA mbuf cluster state management corruption, mbuf leaks, general mbuf corruption, and at least on i386 a first level splash damage radius that encompasses up to about half a megabyte of the memory after an mbuf cluster's allocation slab. In short, this has caused instability nightmares anywhere the right kind of network traffic is present. When the polymorphic refcount slabs were added to UMA, the new types were not used pervasively. In particular, the slab management structure was turned into one for refcounts, and one for non-refcounts (supposed to be mostly like the old slab management structure), but the latter was almost always used through out. In general, every access to zones with UMA_ZONE_REFCNT turned on corrupted the "next free" slab offset offset and the refcount with each other and with other allocations (on i386, 2 mbuf clusters per 4096 byte slab). Fix things so that the right type is used to access refcounted zones where it was not before. There are additional errors in gross overestimation of padding, it seems, that would cause a large kegs (nee zones) to be allocated when small ones would do. Unless I have analyzed this incorrectly, it is not directly harmful.
# 133230	06-Aug-2004	rwatson	Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg(). This doesn't trace every event of interest in UMA, but provides enough basic information to explain lock traces and sleep patterns.
# 132987	01-Aug-2004	green	* Add a "how" argument to uma_zone constructors and initialization functions so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.
# 132842	29-Jul-2004	bmilekic	Rework the way slab header storage space is calculated in UMA. - zone_large_init() stays pretty much the same. - zone_small_init() will try to stash the slab header in the slab page being allocated if the amount of calculated wasted space is less than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular case). If the amount of wasted space is >= UMA_MAX_WASTE, then UMA_ZONE_OFFPAGE will be set and the slab header will be allocated separately for better use of space. - uma_startup() calculates the maximum ipers required in offpage slabs (so that the offpage slab header zone(s) can be sized accordingly). The algorithm used to calculate this replaces the old calculation (which only happened to work coincidentally). We now iterate over possible object sizes, starting from the smallest one, until we determine that wastedspace calculated in zone_small_init() might end up being greater than UMA_MAX_WASTE, at which point we use the found object size to compute the maximum possible ipers. The reason this works is because: - wastedspace versus objectsize is a see-saw function with local minima all equal to zero and local maxima growing directly proportioned to objectsize. This implies that for objects up to or equal a certain objectsize, the see-saw remains entirely below UMA_MAX_WASTE, so for those objectsizes it is impossible to ever go OFFPAGE for slab headers. - ipers (items-per-slab) versus objectsize is an inversely proportional function which falls off very quickly (very large for small objectsizes). - To determine the maximum ipers we'll ever need from OFFPAGE slab headers we first find the largest objectsize for which we are guaranteed to not go offpage for and use it to compute ipers (as though we were offpage). Since the only objectsizes allowed to go offpage are bigger than the found objectsize, and since ipers vs objectsize is inversely proportional (and monotonically decreasing), then we are guaranteed that the ipers computed is always >= what we will ever need in offpage slab headers. - Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly padded) size of each freelist index so that offset calculations are fixed. This might fix weird data corruption problems and certainly allows ARM to now boot to at least single-user (via simulator). Tested on i386 UP by me. Tested on sparc64 SMP by fenner. Tested on ARM simulator to single-user by cognet.
# 132550	22-Jul-2004	alc	- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of kmem_alloc_pageable(). The difference between these is that an errant memory access to the zone will be detected sooner with kmem_alloc_nofault(). The following changes serve to eliminate the following lock-order reversal reported by witness: 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931 There is no potential deadlock in this case. However, witness is unable to recognize this because vm objects used by UMA have the same type as ordinary vm objects. To remedy this, we make the following changes: - Add a mutex type argument to VM_OBJECT_LOCK_INIT(). - Use the mutex type argument to assign distinct types to special vm objects such as the kernel object, kmem object, and UMA objects. - Define a static swap zone object for use by UMA. (Only static objects are assigned a special mutex type.)
# 132407	19-Jul-2004	green	Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in GENERIC/for WITNESS users, make sure the sysctl to disable the behavior is read-only and always enabled.
# 131573	04-Jul-2004	bmilekic	Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1 and WITNESS is not built, then force all M_WAITOK allocations to M_NOWAIT behavior (transparently). This is to be used temporarily if wierd deadlocks are reported because we still have code paths that perform M_WAITOK allocations with lock(s) held, which can lead to deadlock. If WITNESS is compiled, then the sysctl is ignored and we ask witness to tell us wether we have locks held, converting to M_NOWAIT behavior only if it tells us that we do. Note this removes the previous mbuf.h inclusion as well (only needed by last revision), and cleans up unneeded [artificial] comparisons to just the mbuf zones. The problem described above has nothing to do with previous mbuf wait behavior; it is a general problem.
# 131572	04-Jul-2004	green	Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related zones, and do it by direct comparison of uma_zone_t instead of strcmp. The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but this is mostly no longer the case. M_WAITOK has taken over the spot M_TRYWAIT used to have, and for mbuf things, still may return NULL if the code path is incorrectly holding a mutex going into mbuf allocation functions. The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock the system to try to malloc or uma_zalloc something with a mutex held and M_WAITOK specified, it is absolutely required to not return NULL and will result in instability and/or security breaches otherwise. There is still room to add the WITNESS_WARN() to all cases so that we are notified of the possibility of deadlocks, but it cannot change the value of the "badness" variable and allow allocation to actually fail except for the specialized cases which used to be M_TRYWAIT.
# 131528	03-Jul-2004	green	Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject to failing -- that is, allocations via malloc(M_WAITOK) that are required to never fail -- if WITNESS is not defined. While everyone should be running WITNESS, in any case, zone "Mbuf" allocations are really the only ones that should be screwed with by this hack. This hack is crashing people, and would continue to do so with or without WITNESS. Things shouldn't be allocating with M_WAITOK with locks held, but it's not okay just to always remove M_WAITOK when !WITNESS. Reported by: Bernd Walter <ticso@cicely5.cicely.de>
# 130995	23-Jun-2004	bmilekic	Make uma_mtx MTX_RECURSE. Here's why: The general UMA lock is a recursion-allowed lock because there is a code path where, while we're still configured to use startup_alloc() for backend page allocations, we may end up in uma_reclaim() which calls zone_foreach(zone_drain), which grabs uma_mtx, only to later call into startup_alloc() because while freeing we needed to allocate a bucket. Since startup_alloc() also takes uma_mtx, we need to be able to recurse on it. This exact explanation also added as comment above mtx_init(). Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>
# 130283	09-Jun-2004	bmilekic	Backout previous change, I think Julian has a better solution which does not require type-stable refcnts here.
# 130278	09-Jun-2004	bmilekic	Make the slabrefzone, the zone from which we allocated slabs with internal reference counters, UMA_ZONE_NOFREE. This way, those slabs (with their ref counts) will be effectively type-stable, then using a trick like this on the refcount is no longer dangerous: MEXT_REM_REF(m); if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) { if (m->m_ext.ext_type == EXT_PACKET) { uma_zfree(zone_pack, m); return; } else if (m->m_ext.ext_type == EXT_CLUSTER) { uma_zfree(zone_clust, m->m_ext.ext_buf); m->m_ext.ext_buf = NULL; } else { (*(m->m_ext.ext_free))(m->m_ext.ext_buf, m->m_ext.ext_args); if (m->m_ext.ext_type != EXT_EXTREF) free(m->m_ext.ref_cnt, M_MBUF); } } uma_zfree(zone_mbuf, m); Previously, a second thread hitting the above cmpset might actually read the refcnt AFTER it has already been freed. A very rare occurance. Now we'll know that it won't be freed, though. Spotted by: julian, pjd
# 129906	31-May-2004	bmilekic	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# 126793	10-Mar-2004	alc	- Make the acquisition of Giant in vm_fault_unwire() conditional on the pmap. For the kernel pmap, Giant is not required. In general, for other pmaps, Giant is required by i386's pmap_pte() implementation. Specifically, the use of PMAP2/PADDR2 is synchronized by Giant. Note: In principle, updates to the kernel pmap's wired count could be lost without Giant. However, in practice, we never use the kernel pmap's wired count. This will be resolved when pmap locking appears. - With the above change, cpu_thread_clean() and uma_large_free() need not acquire Giant. (The first case is simply the revival of i386/i386/vm_machdep.c's revision 1.226 by peter.)
# 126714	07-Mar-2004	rwatson	Mark uma_callout as CALLOUT_MPSAFE, as uma_timeout can run MPSAFE. Reviewed by: jeff
# 125294	01-Feb-2004	jeff	- Fix a problem where we did not drain the cache of buckets in the zone when uma_reclaim() was called. This was introduced when the zone working-set algorithm was removed in favor of using the per cpu caches as the working set.
# 125246	30-Jan-2004	des	Mechanical whitespace cleanup.
# 123126	03-Dec-2003	jhb	Fix all users of mp_maxid to use the same semantics, namely: 1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1. 2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid. Approved by: re (scottl) Tested on: i386, amd64, alpha
# 123073	30-Nov-2003	jeff	- Unbreak UP. mp_maxid is not defined on uni-processor machines, although I believe it and the other MP variables should be. For now, just define it here and wait for jhb to clean it up later. Approved by: re (rwatson)
# 123057	30-Nov-2003	jeff	- Replace the local maxcpu with mp_maxid. Previously, if mp_maxid was equal to MAXCPU, we would overrun the pcpu_mtx array because maxcpu was calculated incorrectly. - Add some more debugging code so that memory leaks at the time of uma_zdestroy() are more easily diagnosed. Approved by: re (rwatson)
# 122680	14-Nov-2003	alc	- Remove use of Giant from uma_zone_set_obj().
# 120311	21-Sep-2003	jeff	- Fix MD_SMALL_ALLOC on architectures that support it. Define a new alloc function, startup_alloc(), that is used for single page allocations prior to the VM starting up. If it is used after the VM startups up, it replaces the zone's allocf pointer with either page_alloc() or uma_small_alloc() where appropriate. Pointy hat to: me Tested by: phk/amd64, me/x86
# 120305	20-Sep-2003	peter	Bad Jeffr! No cookie! Temporarily disable the UMA_MD_SMALL_ALLOC stuff since recent commits break sparc64, amd64, ia64 and alpha. It appears only i386 and maybe powerpc were not broken.
# 120262	19-Sep-2003	jeff	- Remove the working-set algorithm. Instead, use the per cpu buckets as the working set cache. This has several advantages. Firstly, we never touch the per cpu queues now in the timeout handler. This removes one more reason for having per cpu locks. Secondly, it reduces the size of the zone by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This tidies up other logic as well. - The 'destroy' flag no longer needs to be passed to zone_drain() since it always frees everything in the zone's slabs. - cache_drain() is now only called from zone_dtor() and so it destroys by default. It also does not need the destroy parameter now.
# 120255	19-Sep-2003	jeff	- Remove the cache colorization code. We can't use it due to all of the broken consumers of the malloc interface who assume that the allocated address will be an even multiple of the size. - Remove disabled time delay code on uma_reclaim(). The comment there said it all. It was not an effective strategy and it should not be left in #if 0'd for all eternity.
# 120249	19-Sep-2003	jeff	- There are an endless stream of style(9) errors in this file. Fix a few. Also catch some spelling errors.
# 120229	19-Sep-2003	jeff	- Don't inspect the zone in page_alloc(). It may be NULL. - Don't cache more items than the zone would like in uma_zalloc_bucket().
# 120224	19-Sep-2003	jeff	- Move the logic for dealing with the uma_boot_pages cache into the page_alloc() function from the slab_zalloc() function. This allows us to unconditionally call uz_allocf(). - In page_alloc() cleanup the boot_pages logic some. Previously memory from this cache that was not used by the time the system started was left in the cache and never used. Typically this wasn't more than a few pages, but now we will use this cache so long as memory is available.
# 120223	19-Sep-2003	jeff	- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags by accepting the user supplied flags directly. Previously this was not done so that flags for the same field would not be defined in two different files. Add comments in each header instructing future developers on how now to shoot their feet. - Fix a test for !OFFPAGE which should have been a test for HASH. This would have caused a panic if we had ever destructed a malloc zone. This also opens up the possibility that other zones could use the vsetobj() method rather than a hash.
# 120221	19-Sep-2003	jeff	- Don't abuse M_DEVBUF, define a tag for UMA hashes.
# 120219	19-Sep-2003	jeff	- Eliminate a pair of unnecessary variables.
# 120218	19-Sep-2003	jeff	- Initialize a pool of bucket zones so that we waste less space on zones that don't cache as many items. - Introduce the bucket_alloc(), bucket_free() functions to wrap bucket allocation. These functions select the appropriate bucket zone to allocate from or free to. - Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects the count of free items in the bucket. This gets rid of many unnatural subtractions by 1 throughout the code. - Add ub_entries which reflects the number of entries possibly held in a bucket.
# 119182	20-Aug-2003	bmilekic	In sysctl_vm_zone, do not calculate per-cpu cache stats on UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha was crashing while entering multi-user because, I think, we were calculating the garbage cachefree for pcpu caches that essentially don't exist for at least the 'zones' zone and it so happened that we were reading from an unmapped location. Confirmed to fix crash: wilko Helped debug: wilko, gallatin
# 118795	11-Aug-2003	bmilekic	- When deciding whether to init the zone with small_init or large_init, compare the zone element size (+1 for the byte of linkage) against UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE. Add a KASSERT in zone_small_init to make sure that the computed ipers (items per slab) for the zone is not zero, despite the addition of the check, just to be sure (this part submitted by: silby) - UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the case of bucket allocations, but in addition to that also ensures that we don't setup the zone with OFFPAGE slab headers allocated from the slabzone. This means that we're not allowed to have a UMA_ZONE_VM zone initialized for large items (zone_large_init) because it would require the slab headers to be allocated from slabzone, and hence kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd before kmem_map is suballoc'd from kernel_map, which is why this change is necessary.
# 118380	03-Aug-2003	alc	Revise obj_alloc(). Most notably, use the object's lock to prevent two concurrent invocations from acquiring the same address(es). Also, in case of an incomplete allocation, free any allocated pages. In collaboration with: tegge
# 118369	02-Aug-2003	bmilekic	When INVARIANTS is on and we're in uma_zalloc_free(), we need to make sure that uma_dbg_free() is called if we're about to call uma_zfree_internal() but we're asking it to skip the dtor and uma_dbg_free() call itself. So, if we're about to call uma_zfree_internal() from uma_zfree_arg() and skip == 1, call uma_dbg_free() ourselves.
# 118315	01-Aug-2003	bmilekic	Only free the pcpu cache buckets if they are non-NULL. Crashed this person's machine: harti Pointy-hat to: me
# 118221	30-Jul-2003	bmilekic	Plug a race and a leak in UMA. 1) The race has to do with zone destruction. From the zone destructor we would lock the zone, set the working set size to 0, then unlock the zone, drain it, and then free the structure. Within the window following the working-set-size set to 0 and unlocking of the zone and the point where in zone_drain we re-acquire the zone lock, the uma timer routine could have fired off and changed the working set size to something non-zero, thereby potentially preventing us from completely freeing slabs before destroying the zone (and thus leaking them). 2) The leak has to do with zone destruction as well. When destroying a zone we would take care to free all the buckets cached in the zone, but although we would drain the pcpu cache buckets, we would not free them. This resulted in leaking a couple of bucket structures (512 bytes each) per cpu on SMP during zone destruction. While I'm here, also silence GCC warnings by turning uma_slab_alloc() from inline to real function. It's too big to be an inline. Reviewed by: JeffR
# 118212	30-Jul-2003	bmilekic	When generating the zone stats make sure to handle the master zone ("UMA Zone") carefully, because it does not have pcpu caches allocated at all. In the UP case, we did not catch this because one pcpu cache is always allocated with the zone, but for the MP case, we were getting bogus stats for this zone. Tested by: Lukas Ertl <le@univie.ac.at>
# 118201	30-Jul-2003	phk	Remove the disabling of buckets workaround. Thanks to: jeffr
# 118190	30-Jul-2003	jeff	- Get rid of the ill-conceived uz_cachefree member of uma_zone. - In sysctl_vm_zone use the per cpu locks to read the current cache statistics this makes them more accurate while under heavy load. Submitted by: tegge
# 118189	30-Jul-2003	jeff	- Check to see if we need a slab prior to allocating one. Failure to do so not only wastes memory but it can also cause a leak in zones that will be destroyed later. The problem is that the slab allocation code places newly created slabs on the partially allocated list because it assumes that the caller will actually allocate some memory from it. Failure to do so places an otherwise free slab on the partial slab list where we wont find it later in zone_drain(). Continuously prodded to fix by: phk (Thanks)
# 118187	29-Jul-2003	phk	Temporary workaround: Always disable buckets, there is a bug there somewhere. JeffR will look at this as soon as he has time. OK'ed by: jeffr
# 118104	28-Jul-2003	alc	None of the "alloc" functions used by UMA assume that Giant is held any longer. (If they still need it, e.g., contigmalloc(), they acquire it themselves.) Therefore, we need not acquire Giant in slab_zalloc().
# 118040	26-Jul-2003	alc	Gulp ... call kmem_malloc() without Giant.
# 117736	18-Jul-2003	harti	When INVARIANTS is defined make sure that uma_zalloc_arg (and hence uma_zalloc) is called with exactly one of either M_WAITOK or M_NOWAIT and that it is called with neither M_TRYWAIT or M_DONTWAIT. Print a warning if anything is wrong. Default to M_WAITOK of no flag is given. This is the same test as in malloc(9).
# 116837	25-Jun-2003	bmilekic	Move the pcpu lock out of the uma_cache and instead have a single set of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN * sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact). No Objections from jeff.
# 116829	25-Jun-2003	bmilekic	Make sure that the zone destructor doesn't get called twice in certain free paths.
# 116226	11-Jun-2003	obrien	Use __FBSDID().
# 116131	09-Jun-2003	phk	Revert last commit, I have no idea what happened.
# 116117	09-Jun-2003	phk	A white-space nit I noticed.
# 114149	28-Apr-2003	alc	uma_zone_set_obj() must perform VM_OBJECT_LOCK_INIT() if the caller provides storage for the vm_object.
# 114052	26-Apr-2003	alc	Remove an XXX comment. It is no longer a problem.
# 113699	18-Apr-2003	alc	Lock the vm_object in obj_alloc().
# 113665	18-Apr-2003	gallatin	Don't grab Giant in slab_zalloc() if M_NOWAIT is specified. This should allow the use of INTR_MPSAFE network drivers. Tested by: njl Glanced at by: jeff
# 112683	26-Mar-2003	tegge	Obtain Giant before calling kmem_alloc without M_NOWAIT and before calling kmem_free if Giant isn't already held.
# 111883	04-Mar-2003	jhb	Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls to WITNESS_WARN().
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 110313	04-Feb-2003	phk	Change a printf to also tell how many items were left in the zone.
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 109548	19-Jan-2003	jeff	- M_WAITOK is 0 and not a real flag. Test for this properly. Submitted by: tmm Pointy hat to: jeff
# 108533	01-Jan-2003	schweikh	Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
# 107048	18-Nov-2002	jeff	- Wakeup the correct address when a zone is no longer full. Spotted by: jake
# 106992	16-Nov-2002	jeff	- Don't forget the flags value when using boot pages. Reported by: grehan
# 106773	11-Nov-2002	mjacob	atomic_set_8 isn't MI. Instead, follow Jake's suggestions about ZONE_LOCK.
# 106277	31-Oct-2002	jeff	- Add support for machine dependant page allocation routines. MD code may define UMA_MD_SMALL_ALLOC to make use of this feature. Reviewed by: peter, jake
# 105853	24-Oct-2002	jeff	- Now that uma_zalloc_internal is not the fast path don't be so fussy about extra function calls. Refactor uma_zalloc_internal into seperate functions for finding the most appropriate slab, filling buckets, allocating single items, and pulling items off of slabs. This makes the code significantly cleaner. - This also fixes the "Returning an empty bucket." panic that a few people have seen. Tested On: alpha, x86
# 105848	24-Oct-2002	jeff	- Move the destructor calls so that they are not called with the zone lock held. This avoids a lock order reversal when destroying zones. Unfortunately, this also means that the free checks are not done before the destructor is called. Reported by: phk
# 104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 103623	19-Sep-2002	jeff	- Use my freebsd email alias in the copyright. - Remove redundant instances of my email alias in the file summary.
# 103531	18-Sep-2002	jeff	- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH. - Remove all instances of the mallochash. - Stash the slab pointer in the vm page's object pointer when allocating from the kmem_obj. - Use the overloaded object pointer to find slabs for malloced memory.
# 102241	21-Aug-2002	archie	Don't use "NULL" when "0" is really meant.
# 99472	05-Jul-2002	jeff	Fix a lock order reversal in uma_zdestroy. The uma_mtx needs to be held across calls to zone_drain(). Noticed by: scottl
# 99424	05-Jul-2002	jeff	Remove unnecessary includes.
# 99320	02-Jul-2002	jeff	Actually use the fini callback. Pointy hat to: me :-( Noticed By: Julian
# 98822	25-Jun-2002	jeff	Reduce the amount of code that runs with the zone lock held in slab_zalloc(). This allows us to run the zone initialization functions without any locks held.
# 98451	19-Jun-2002	jeff	- Remove bogus use of kmem_alloc that was inherited from the old zone allocator. - Properly set M_ZERO when talking to the back end page allocators for non malloc zones. This forces us to zero fill pages when they are first brought into a cache. - Properly handle M_ZERO in uma_zalloc_internal. This fixes a problem where per cpu buckets weren't always getting zeroed.
# 98362	17-Jun-2002	jeff	Honor the BUCKETCACHE flag on free as well.
# 98361	17-Jun-2002	jeff	- Introduce the new M_NOVM option which tells uma to only check the currently allocated slabs and bucket caches for free items. It will not go ask the vm for pages. This differs from M_NOWAIT in that it not only doesn't block, it doesn't even ask. - Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This tells uma that it should only allocate buckets out of the bucket cache, and not from the VM. It does this by using the M_NOVM option to zalloc when getting a new bucket. This is so that the VM doesn't recursively enter itself while trying to allocate buckets for vm_map_entry zones. If there are already allocated buckets when we get here we'll still use them but otherwise we'll skip it. - Use the ZONE_VM flag on vm map entries and pv entries on x86.
# 98075	10-Jun-2002	iedowse	Correct the logic for determining whether the per-CPU locks need to be destroyed. This fixes a problem where destroying a UMA zone would fail to destroy all zone mutexes. Reviewed by: jeff
# 97787	03-Jun-2002	jeff	Add a comment describing a resource leak that occurs during a failure case in obj_alloc.
# 97007	20-May-2002	jhb	In uma_zalloc_arg(), if we are performing a M_WAITOK allocation, ensure that td_intr_nesting_level is 0 (like malloc() does). Since malloc() calls uma we can probably remove the check in malloc() for this now. Also, perform an extra witness check in that case to make sure we don't hold any locks when performing a M_WAITOK allocation.
# 96496	13-May-2002	jeff	Don't call the uz free function while the zone lock is held. This can lead to lock order reversals. uma_reclaim now builds a list of freeable slabs and then unlocks the zones to do all of the frees.
# 96493	13-May-2002	jeff	Remove the hash_free() lock order reversal. This could have happened for several reasons before. Fixing it involved restructuring the generic hash code to require calling code to handle locking, unlocking, and freeing hashes on error conditions.
# 96044	04-May-2002	jeff	Use pages instead of uz_maxpages, which has not been initialized yet, when creating the vm_object. This was broken after the code was rearranged to grab giant itself. Spotted by: alc
# 95930	02-May-2002	jeff	Move around the dbg code a bit so it's always under a lock. This stops a weird potential race if we were preempted right as we were doing the dbg checks.
# 95925	02-May-2002	arr	- Changed the size element of uma_zctor_args to be size_t instead of int. - Changed uma_zcreate to accept the size argument as a size_t intead of int. Approved by: jeff
# 95923	02-May-2002	jeff	malloc/free(9) no longer require Giant. Use the malloc_mtx to protect the mallochash. Mallochash is going to go away as soon as I introduce the kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen until all users of the malloc api that expect memory to be aligned on the size of the allocation are fixed.
# 95899	02-May-2002	jeff	Remove the temporary alignment check in free(). Implement the following checks on freed memory in the bucket path: - Slab membership - Alignment - Duplicate free This previously was only done if we skipped the buckets. This code will slow down INVARIANTS a bit, but it is smp safe. The checks were moved out of the normal path and into hooks supplied in uma_dbg.
# 95766	30-Apr-2002	jeff	Move the implementation of M_ZERO into UMA so that it can be passed to uma_zalloc and friends. Remove this functionality from the malloc wrapper. Document this change in uma.h and adjust variable names in uma_core.
# 95758	29-Apr-2002	jeff	Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own mutex class. Currently this is only used for kmapentzone because kmapents are are potentially allocated when freeing memory. This is not dangerous though because no other allocations will be done while holding the kmapentzone lock.
# 95432	25-Apr-2002	arr	- Fix a round down bogon in uma_zone_set_max(). Submitted by: jeff@
# 94653	14-Apr-2002	jeff	Fix a witness warning when expanding a hash table. We were allocating the new hash while holding the lock on a zone. Fix this by doing the allocation seperately from the actual hash expansion. The lock is dropped before the allocation and reacquired before the expansion. The expansion code checks to see if we lost the race and frees the new hash if we do. We really never will lose this race because the hash expansion is single threaded via the timeout mechanism.
# 94651	14-Apr-2002	jeff	Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx.
# 94631	13-Apr-2002	jeff	Fix the calculation that determines uz_maxpages. It was off for large zones. Fortunately we have no large zones with maximums specified yet, so it wasn't breaking anything. Implement blocking when a zone exceeds the maximum and M_WAITOK is specified. Previously this just failed like the old zone allocator did. The old zone allocator didn't support WAITOK/NOWAIT though so we should do what we advertise. While I was in there I cleaned up some more zalloc logic to further simplify that code path and reduce redundant code. This was needed to make the blocking work properly anyway.
# 94329	09-Apr-2002	jeff	Remember to unlock the zone if the fill count is too high. Pointed out by: pete, jake, jhb
# 94165	08-Apr-2002	jeff	Add a mechanism to disable buckets when the v_free_count drops below v_free_min. This should help performance in memory starved situations.
# 94163	08-Apr-2002	jeff	Don't release the zone lock until after the dtor has been called. As far as I can tell this could not have caused any problems yet because UMA is still called with giant. Pointy hat to: jeff Noticed by: jake
# 94161	08-Apr-2002	jeff	Implement uma_zdestroy(). It's prototype changed slightly. I decided that I didn't like the wait argument and that if you were removing a zone it had better be empty. Also, I broke out part of hash_expand and made a seperate hash_free() for use in uma_zdestroy.
# 94159	08-Apr-2002	jeff	Rework most of the bucket allocation and free code so that per cpu locks are never held across blocking operations. Also, fix two other lock order reversals that were exposed by jhb's witness change. The free path previously had a bug that would cause it to skip the free bucket list in some cases and go straight to allocating a new bucket. This has been fixed as well. These changes made the bucket handling code much cleaner and removed quite a few lock operations. This should be marginally faster now. It is now possible to call malloc w/o Giant and avoid any witness warnings. This still isn't entirely safe though because malloc_type statistics are not protected by any lock.
# 94155	07-Apr-2002	jeff	This fixes a bug where isitem never got set to 1 if a certain chain of events relating to extreme low memory situations occured. This was only ever seen on the port build cluster, so many thanks to kris for helping me debug this. Tested by: kris
# 93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 93697	02-Apr-2002	alfred	fix comment typo, s/neccisary/necessary/g
# 93089	24-Mar-2002	jeff	Reset the cachefree statistics after draining the cache. This fixes a bug where a sysctl within 20 seconds of a cache_drain could yield negative "USED" counts. Also, grab the uma_mtx while in the sysctl handler. This hadn't caused problems yet because Giant is held all the time. Reported by: kkenn
# 92758	20-Mar-2002	jeff	Add uma_zone_set_max() to add enforced limits to non vm obj backed zones.
# 92654	19-Mar-2002	jeff	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@