null

History log of /freebsd-current/sys/vm/uma_int.h
Revision	Date	Author	Comments
# 95ee2897	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: two-line .h pattern Remove /^\s\\n \*\s+\$FreeBSD\$$\n/
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# 2760658b	02-May-2021	Alexander Motin <mav@FreeBSD.org>	Improve UMA cache reclamation. When estimating working set size, measure only allocation batches, not free batches. Allocation and free patterns can be very different. For example, ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call, but it does not mean it will request the same amount back that fast too, in fact it won't. Update working set size on every reclamation call, shrinking caches faster under pressure. Lack of this caused repeating vm_lowmem events squeezing more and more memory out of real consumers only to make it stuck in UMA caches. I saw ZFS drop ARC size in half before previous algorithm after periodic WSS update decided to reclaim UMA caches. Introduce voluntary reclamation of UMA caches not used for a long time. For each zdom track longterm minimal cache size watermark, freeing some unused items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed memory can get better use by other consumers. For example, ZFS won't grow its ARC unless it see free memory, since it does not know it is not really used. And even if memory is not really needed, periodic free during inactivity periods should reduce its fragmentation. Reviewed by: markj, jeff (previous version) MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D29790
# aabe13f1	13-Apr-2021	Mark Johnston <markj@FreeBSD.org>	uma: Introduce per-domain reclamation functions Make it possible to reclaim items from a specific NUMA domain. - Add uma_zone_reclaim_domain() and uma_reclaim_domain(). - Permit parallel reclamations. Use a counter instead of a flag to synchronize with zone_dtor(). - Use the zone lock to protect cache_shrink() now that parallel reclaims can happen. - Add a sysctl that can be used to trigger reclamation from a specific domain. Currently the new KPIs are unused, so there should be no functional change. Reviewed by: mav MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29685
# d2f1c44b	27-Dec-2020	Mark Johnston <markj@FreeBSD.org>	uma: Remove the MINBUCKET flag from the flag name list This should have been done in r368399 / commit f8b6c51538fab88a7a62a399fb0948806b06133c. Reported by: rlibby Sponsored by: The FreeBSD Foundation
# c3aa3bf9	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	vm: clean up empty lines in .c and .h files
# a2e19465	28-Aug-2020	Eric van Gyzen <vangyzen@FreeBSD.org>	memstat_kvm_uma: fix reading of uma_zone_domain structures Coverity flagged the scaling by sizeof(uzd). That is the type of the pointer, so the scaling was already done by pointer arithmetic. However, this was also passing a stack frame pointer to kvm_read, so it was doubly wrong. Move ZDOM_GET into the !_KERNEL section and use it in libmemstat. Reported by: Coverity Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26213
# c8b0a88b	20-Jun-2020	Jeff Roberson <jeff@FreeBSD.org>	Clarify some language. Favor primary where both master and primary were used in conjunction with secondary.
# 54007ce8	07-Mar-2020	Mark Johnston <markj@FreeBSD.org>	Clean up uma_int.h a bit. This makes it easier to write libkvm programs that access UMA data structures. - Remove a couple of unused slab functions and make others local to uma_core.c. Similarly move SLAB_BITSETS, which affects the layout of slab structures, to uma_core.c. - Stop defining the slab structures under _KERNEL. There's no real reason they can't be visible to userspace like the rest of UMA's structures are. - Group KEG_ASSERT_COLD with other keg macros. - Convert an assertion about MAXMEMDOM to use _Static_assert. No functional change intended. Discussed with: jeff Reviewed by: rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23980
# c6fd3e23	19-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain locks for the bucket cache. This gives much better concurrency when there are a large number of cores per-domain and multiple domains. Avoid taking the lock entirely if it will not be productive. ROUNDROBIN domains will have mixed memory in each domain and will load balance to all domains. While here refactor the zone/domain separation and bucket limits to simplify callers. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23673
# 4ab3aee8	11-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Reduce lock hold time in keg_drain(). Maintain a count of free slabs in the per-domain keg structure and use that to clear the free slab list in constant time for most cases. This helps minimize lock contention induced by reclamation, in preparation for proactive trimming of excesses of free memory. Reviewed by: jeff, rlibby Tested by: pho Differential Revision: https://reviews.freebsd.org/D23532
# bae55c4a	06-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: remove UMA_ZFLAG_CACHEONLY flag UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but with a more confusing name. Remove the flag, make UMA_ZONE_VM an inherit flag, and replace all references. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23516
# ec0d8280	04-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: add UMA_ZONE_CONTIG, and a default contig_alloc For now, copy the mbuf allocator. Reviewed by: jeff, markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23237
# dc3915c8	03-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior and this is more space efficient. Stop queueing recently used buckets to the head of the list. If the bucket goes to a different processor the cache coherency will be more expensive. We already try to encourage cache-hot behavior in the per-cpu layer. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23493
# d4665eaa	30-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Implement a safe memory reclamation feature that is tightly coupled with UMA. This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is a unique algorithm. This has 3x the performance of epoch in a write heavy workload with less than half of the read side cost. The memory overhead is significantly lessened by limiting the free-to-use latency. A synthetic test uses 1/20th of the memory vs Epoch. There is significant further discussion in the comments and code review. This code should be considered experimental. I will write a man page after it has settled. After further validation the VM will begin using this feature to permit lockless page lookups. Both markj and cperciva tested on arm64 at large core counts to verify fences on weaker ordering architectures. I will commit a stress testing tool in a follow-up. Reviewed by: mmacy, markj, rlibby, hselasky Discussed with: sbahara Differential Revision: https://reviews.freebsd.org/D22586
# 9b8db4d0	13-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: split slabzone into two sizes By allowing more items per slab, we can improve memory efficiency for small allocs. If we were just to increase the bitmap size of the slabzone, we would then waste slabzone memory. So, split slabzone into two zones, one especially for 8-byte allocs (512 per slab). The practical effect should be reduced memory usage for counter(9). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23149
# 4a8b575c	08-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: unify layout paths and improve efficiency Unify the keg layout selection paths (keg_small_init, keg_large_init, keg_cachespread_init), and slightly improve memory efficiecy by: - using the padding of the final item to store the slab header, - not going OFFPAGE if we have a choice unless it improves efficiency. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23048
# 54c5ae80	08-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: reorganize flags - Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC. - Move flag VTOSLAB from public to private. - Introduce public NOTPAGE flag and make HASH private. - Introduce public NOTOUCH flag and make OFFPAGE private. - Update man page. The net effect of this should be to make the contract with clients more clear. Clients should choose constraints, UMA will figure out how to implement them. This also breaks the confusing double meaning of OFFPAGE. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23016
# 79c9f942	05-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046
# 31c251a0	04-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix an assertion introduced in r356348. On architectures without UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that violated the new assert. Resolve this by rewriting the COLD asserts to look at the per-cpu allocation counts for evidence of api activity. Discussed with: rlibby Reviewed by: markj Reported by: lwhsu
# dfe13344	04-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending on kernel configuration. API users need only specify one if they want to override the default. Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA is. Reviewed by: markj Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D22831
# 91d947bf	04-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Sort cross-domain frees into per-domain buckets before inserting these onto their respective bucket lists. This is a several order of magnitude improvement in contention on the keg lock under heavy free traffic while requiring only an additional bucket per-domain worth of memory. Discussed with: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22830
# 8b987a77	03-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain keg locks. This provides both a lock and separate space accounting for each NUMA domain. Independent keg domain locks are important with cross-domain frees. Hashed zones are non-numa and use a single keg lock to protect the hash table. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22829
# 727c6918	03-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828
# 4bd61e19	03-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Use atomics for the zone limit and sleeper count. This relies on the sleepq to serialize sleepers. This patch retains the existing sleep/wakeup paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup in one case but otherwise should be bug for bug compatible. In particular, there are still various races surrounding adjusting the limit via sysctl that are now documented. Discussed with: markj Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22827
# cc7ce83a	25-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Further reduce the cacheline footprint of fast allocations by duplicating the zone size and flags fields in the per-cpu caches. This allows fast alloctions to proceed only touching the single per-cpu cacheline and simplifies the common case when no ctor/dtor is specified. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22826
# 376b1ba3	25-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Optimize fast path allocations by storing bucket headers in the per-cpu cache area. This allows us to check on bucket space for all per-cpu buckets with a single cacheline access and fewer branches. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22825
# 815db204	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff (previous version), markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759
# d82c8ffb	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	Revert r355706 & r355710 The quick fix didn't work. I'll sort it out tomorrow. Revert r355710: "libmemstat: unbreak build" Revert r355706: "uma dbg: flexible size for slab debug bitset too"
# 7508f15f	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759
# 6d204a6a	10-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: pretty print zone flags sysctl Requested by: jeff Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22748
# 1e0701e1	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Use a variant slab structure for offpage zones. This saves space in embedded slabs but also is an opportunity to tidy up code and add accessor inlines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22609
# 9b78b1f4	02-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Use a precise bit count for the slab free items in UMA. This significantly shrinks embedded slab structures. Reviewed by: markj, rlibby (prior version) Differential Revision: https://reviews.freebsd.org/D22584
# 6d6a03d7	28-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Handle large mallocs by going directly to kmem. Taking a detour through UMA does not provide any additional value. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22563
# 584061b4	28-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Garbage collect the mostly unused us_keg field. Use appropriately named union members in vm_page.h to store the zone and slab. Remove some nearby dead code. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22564
# 20a4e154	27-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Implement a sysctl tree for uma zones to assist in debugging and provide more statistcs than are exported via the ABI stable vmstat interface. Rename uz_count to uz_bucket_size because even I was confused by the name after returning to the source years later. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22554
# ca293436	27-Nov-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: trash memory when ctor/dtor supplied too On INVARIANTS kernels, UMA has a use-after-free detection mechanism. This mechanism previously required that all of the ctor/dtor/uminit/fini arguments to uma_zcreate() be NULL in order to function. Now, it only requires that uminit and fini be NULL; now, the trash ctor and dtor will be called in addition to any supplied ctor or dtor. Also do a little refactoring for readability of the resulting logic. This enables use-after-free detection for more zones, and will allow for simplification of some callers that worked around the previous restriction (see kern_mbuf.c). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20722
# 08cfa56e	01-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667
# c1685086	06-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Add two new kernel options to control memory locality on NUMA hardware. - UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that was freed on a different domain from where it was allocated. This is only used for UMA_ZONE_NUMA (first-touch) zones. - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all zones. This tries to maintain locality for kernel memory. Reviewed by: gallatin, alc, kib Tested by: pho, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20929
# 6929b7d1	11-Feb-2019	Pedro F. Giffuni <pfg@FreeBSD.org>	UMA: unsign some variables related to allocation in hash_alloc(). As a followup to r343673, unsign some variables related to allocation since the hashsize cannot be negative. This gives a bit more space to handle bigger allocations and avoid some implicit casting. While here also unsign uh_hashmask, it makes little sense to keep that signed. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19148
# ad66f958	06-Feb-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Now that there is only one way to allocate a slab, remove uz_slab method. Discussed with: jeff
# b68d692a	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Whitespace.
# 39669415	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Fix compilation failures on different arches that have vm_machdep.c not aware of counter_u64_t by including counter.h into uma_int.h. I'm not happy about this inclusion, but it fixes compilation ASAP.
# 2efcc8cb	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Make uz_allocs, uz_frees and uz_fails counter(9). This removes some atomic updates and reduces amount of data protected by zone lock. During startup point these fields to EARLY_COUNTER. After startup allocate them for all early zones. Tested by: pho
# bb15d1c7	14-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	o Move zone limit from keg level up to zone level. This means that now two zones sharing a keg may have different limits. Now this is going to work: zone = uma_zcreate(); uma_zone_set_max(zone, limit); zone2 = uma_zsecond_create(zone); uma_zone_set_max(zone2, limit2); Kegs no longer have uk_maxpages field, but zones have uz_items. When set, it may be rounded up to minimum possible CPU bucket cache size. For small limits bucket cache can also be reconfigured to be smaller. Counter uz_items is updated whenever items transition from keg to a bucket cache or directly to a consumer. If zone has uz_maxitems set and it is reached, then we are going to sleep. o Since new limits don't play well with multi-keg zones, remove them. The idea of multi-keg zones was introduced exactly 10 years ago, and never have had a practical usage. In discussion with Jeff we came to a wild agreement that if we ever want to reintroduce the idea of a smart allocator that would be able to choose between two (or more) totally different backing stores, that choice should be made one level higher than UMA, e.g. in malloc(9) or in mget(), or whatever and choice should be controlled by the caller. o Sleeping code is improved to account number of sleepers and wake them one by one, to avoid thundering herd problem. o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache() KPI added. Having no bucket cache basically means setting maxcache to 0. o Now with many fields added and many removed (no multi-keg zones!) make sure that struct uma_zone is perfectly aligned. Reviewed by: markj, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D17773
# 3d5e3df7	28-Nov-2018	Gleb Smirnoff <glebius@FreeBSD.org>	For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff is calculated to guarantee that struct uma_slab is placed at pointer size alignment. Calculation of real struct uma_slab size is done in keg_ctor() and yet again in keg_large_init(), to check if we need an extra page. This calculation can actually be performed at compile time. - Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at an end of a page with alignment requirement. - Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not a functional change. - Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init(). This is a potential bugfix, but in reality I don't think there are any systems affected, since compiler aligns struct uma_slab anyway.
# 0f9b7bf3	13-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Add accounting to per-domain UMA full bucket caches. In particular, track the current size of the cache and maintain an estimate of its working set size. This will be used to decide how much to shrink various caches when the kernel attempts to reclaim pages. As a secondary effect, it makes statistics aggregation (done by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer needs to iterate over lists of cached buckets. Discussed with: alc, glebius, jeff Tested by: pho (previous version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D16666
# 7571e249	24-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Add an #include required after r339686. X-MFC with: r339686 Sponsored by: The FreeBSD Foundation
# 194a979e	24-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Use a vm_domainset iterator in keg_fetch_slab(). Previously, it used a hand-rolled round-robin iterator. This meant that the minskip logic in r338507 didn't apply to UMA allocations, and also meant that we would call vm_wait() for individual domains rather than permitting an allocation from any domain with sufficient free pages. Discussed with: jeff Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17420
# 306abf0f	24-Aug-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Either "free" or "allocated" is misleading here, since an item in a bucket is free from perspective of UMA consumer, and it is allocated from perspective of keg. Discussed with: markj Approved by: re (kib)
# a307fb5b	23-Aug-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix comment. The actual meaning of ub_cnt is the opposite.
# 63b5557b	23-Jun-2018	Jeff Roberson <jeff@FreeBSD.org>	Sort uma_zone fields according to 64 byte cache line with adjacent line prefetch on 64bit architectures. Prior to this, two lines were needed for the fast path and each line may fetch an unused adjacent neighbor. - Move fields used by the fast path into a single line. - Move constants into the adjacent line which is mostly used for the spare bucket alloc 'medium path'. - Unpad the mtx which is only used by the fast path and place it in a line with rarely used data. This aligns the cachelines better and eliminates 128 bytes of wasted space. This gives a 45% improvement on a will-it-scale test on a 24 core machine. Reviewed by: mmacy
# 12f69195	04-Jun-2018	Justin Hibbits <jhibbits@FreeBSD.org>	Align UMA data to 128 byte cacheline size Suggested by: mjg
# 782e38aa	11-May-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: increase alignment to 128 bytes on amd64 Current UMA internals are not suited for efficient operation in multi-socket environments. In particular there is very common use of MAXCPU arrays and other fields which are not always properly aligned and are not local for target threads (apart from the first node of course). Turns out the existing UMA_ALIGN macro can be used to mostly work around the problem until the code get fixed. The current setting of 64 bytes runs into trouble when adjacent cache line prefetcher gets to work. An example 128-way benchmark doing a lot of malloc/frees has the following instruction samples: before: kernel`lf_advlockasync+0x43b 32940 kernel`malloc+0xe5 42380 kernel`bzero+0x19 47798 kernel`spinlock_exit+0x26 60423 kernel`0xffffffff80 78238 0x0 136947 kernel`uma_zfree_arg+0x46 159594 kernel`uma_zalloc_arg+0x672 180556 kernel`uma_zfree_arg+0x2a 459923 kernel`uma_zalloc_arg+0x5ec 489910 after: kernel`bzero+0xd 46115 kernel`lf_advlockasync+0x25f 46134 kernel`lf_advlockasync+0x38a 49078 kernel`fget_unlocked+0xd1 49942 kernel`lf_advlockasync+0x43b 55392 kernel`copyin+0x4a 56963 kernel`bzero+0x19 81983 kernel`spinlock_exit+0x26 91889 kernel`0xffffffff80 136357 0x0 239424 See the review for more details. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D15346
# 5073a083	07-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]
# f4bef67c	05-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054
# ab3185d1	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon
# ad5b0f5b	01-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494
# 2e47807c	28-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187
# fe267a55	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.
# e04223bf	15-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Include _bitset.h to get BITSET_DEFINE, used to define struct slabbits. MFC after: 1 week
# 2d54d4bb	13-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Widen uk_pgoff, the slab header offset field. 16 bits is only wide enough for kegs with an item size of up to 64KB. At that size or larger, slab headers are typically offpage because the item size is a multiple of the page size, but there is no requirement that this be the case. We can widen the field without affecting the layout of struct uma_keg since the removal of uk_slabsize in r315077 left an adjacent hole. PR: 218911 MFC after: 2 weeks
# a55ebb7c	11-Mar-2017	Andriy Gapon <avg@FreeBSD.org>	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)
# 34caa842	07-Jul-2016	Colin Percival <cperciva@FreeBSD.org>	Autotune the number of pages set aside for UMA startup based on the number of CPUs present. On amd64 this unbreaks the boot for systems with 92 or more CPUs; the limit will vary on other systems depending on the size of their uma_zone and uma_cache structures. The major consumer of pages during UMA startup is the 19 zone structures which are set up before UMA has bootstrapped itself sufficiently to use the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 / 16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP, KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy more than one page, they will not share pages and the number of pages currently needed for startup is 19 * pages_per_zone + N, where N is the number of pages used for allocating other structures; on amd64 N = 3 at present (2 pages are allocated for UMA Kegs, and one page for UMA Hash). This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32, and if a zone structure does not fit into a single page sets boot_pages to UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which remains at 64). Consequently this patch has no effect on systems where the zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but increases boot_pages sufficiently on systems where the large number of CPUs makes this structure larger. It seems safe to assume that systems with 60+ CPUs can afford to set aside an additional 128kB of memory per 32 CPUs. The vm.boot_pages tunable continues to override this computation, but is unlikely to be necessary in the future. Tested on: EC2 x1.32xlarge Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring vm.boot_pages to be manually adjusted. Reviewed by: jeff, alc, adrian Approved by: re (kib)
# 763df3ec	02-May-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/vm: minor spelling fixes in comments. No functional change.
# cfcae3f8	29-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Remove UMA_ZONE_REFCNT feature, now unused. Blessed by: jeff
# b28cc462	09-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a requirement for uma_int.h. Suggested by: jhb
# e60b2fcb	03-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl
# 54503a13	19-Dec-2015	Jonathan T. Looney <jtl@FreeBSD.org>	Add a safety net to reclaim mbufs when one of the mbuf zones become exhausted. It is possible for a bug in the code (or, theoretically, even unusual network conditions) to exhaust all possible mbufs or mbuf clusters. When this occurs, things can grind to a halt fairly quickly. However, we currently do not call mb_reclaim() unless the entire system is experiencing a low-memory condition. While it is best to try to prevent exhaustion of one of the mbuf zones, it would also be useful to have a mechanism to attempt to recover from these situations by freeing "expendable" mbufs. This patch makes two changes: a) The patch adds a generic API to the UMA zone allocator to set a function that should be called when an allocation fails because the zone limit has been reached. Because of the way this function can be called, it really should do minimal work. b) The patch uses this API to try to free mbufs when an allocation fails from one of the mbuf zones because the zone limit has been reached. The function schedules a callout to run mb_reclaim(). Differential Revision: https://reviews.freebsd.org/D3864 Reviewed by: gnn Comments by: rrs, glebius MFC after: 2 weeks Sponsored by: Juniper Networks
# 43ffa928	24-Apr-2015	Scott Long <scottl@FreeBSD.org>	Revert r281451. It causes a panic/hang early in boot for a number of users, myself included. The original code is likely papering over a larger bug that needs to be explored, but for now get things back to a working state. Obtained from: Netflix, Inc. MFC after: immediately
# 51cfb0be	12-Apr-2015	Dmitry Chagin <dchagin@FreeBSD.org>	Rework r281162. Indeed, the flexible array member is preferable here. Suggested by: Justin T. Gibbs MFC after: 3 days
# f2c2231e	31-Mar-2015	Ryan Stone <rstone@FreeBSD.org>	Fix integer truncation bug in malloc(9) A couple of internal functions used by malloc(9) and uma truncated a size_t down to an int. This could cause any number of issues (e.g. indefinite sleeps, memory corruption) if any kernel subsystem tried to allocate 2GB or more through malloc. zfs would attempt such an allocation when run on a system with 2TB or more of RAM. Note to self: When this is MFCed, sparc64 needs the same fix. Differential revision: https://reviews.freebsd.org/D2106 Reviewed by: kib Reported by: Michael Fuckner <michael@fuckner.net> Tested by: Michael Fuckner <michael@fuckner.net> MFC after: 2 weeks
# ace66b56	19-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Implement soft pressure on UMA cache bucket sizes. Every time system detects low memory condition decrease bucket sizes for each zone by one item. As result, higher memory pressure will push to smaller bucket sizes and so smaller per-CPU caches and so more efficient memory use. Before this change there was no force to oppose buckets growth as result of practically inevitable zone lock conflicts, and after some run time per-CPU caches could consume enough RAM to kill the system.
# 9eab5484	17-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	PG_SLAB no longer serves a useful purpose, since m->object is no longer abused to store pointer to slab. Remove it. Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (hrs)
# c325e866	10-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# dab12c75	24-Jul-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Since r251709 a slab no longer use 8-bit indicies to manage items, thus remove a stale comment. Reviewed by: jeff
# 6fd34d6f	25-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division
# af526374	20-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division
# fc03d22b	17-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division
# 0095a784	16-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division
# ef72505e	13-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division
# 85dcf349	09-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Convert UMA code to C99 uintXX_t types.
# 04fc5741	09-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc
# 8cf455b8	09-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib
# ad97af7e	08-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.
# a4915c21	26-Feb-2013	Attilio Rao <attilio@FreeBSD.org>	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide
# 936c747b	21-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Comment fix: there is no ub_ptr, instead explain meaning of uz_count field verbally.
# 2f891cd5	07-Dec-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Implemented uma_zone_set_warning(9) function that sets a warning, which will be printed once the given zone becomes full and cannot allocate an item. The warning will not be printed more often than every five minutes. All UMA warnings can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
# bb196eb4	26-Oct-2012	Matthew D Fleming <mdf@FreeBSD.org>	Const-ify the zone name argument to uma_zcreate(9). MFC after: 3 days
# 342f1793	21-May-2011	Alan Cox <alc@FreeBSD.org>	1. Prior to r214782, UMA did not support multipage allocations before uma_startup2() was called. Thus, setting the variable "booted" to true in uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because any allocations made after uma_startup() but before uma_startup2() could be satisfied by uma_small_alloc(). Now, however, some multipage allocations are necessary before uma_startup2() just to allocate zone structures on machines with a large number of processors. Thus, a Boolean can no longer effectively describe the state of the UMA allocator. Instead, make "booted" have three values to describe how far initialization has progressed. This allows multipage allocations to continue using startup_alloc() until uma_startup2(), but single-page allocations may begin using uma_small_alloc() after uma_startup(). 2. With the aforementioned change, only a modest increase in boot pages is necessary to boot UMA on a large number of processors. 3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between r182028 and r204128. Reviewed by: attilio [1], nwhitehorn [3] Tested by: sbruno
# 59d7277f	20-May-2011	Alan Cox <alc@FreeBSD.org>	Fix spelling errors.
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# bf965959	15-Jun-2010	Sean Bruno <sbruno@FreeBSD.org>	Add a new column to the output of vmstat -z to indicate the number of times the system was forced to sleep when requesting a new allocation. Expand the debugger hook, db_show_uma, to display these results as well. This has proven to be very useful in out of memory situations when it is not known why systems have become sluggish or fail in odd ways. Reviewed by: rwatson alc Approved by: scottl (mentor) peter Obtained from: Yahoo Inc.
# 1a23373c	22-Mar-2010	Kip Macy <kmacy@FreeBSD.org>	- enable alignment on amd64 only - only align pcpu caches and the volatile portion of uma_zone
# 6b4391d7	18-Mar-2010	Kip Macy <kmacy@FreeBSD.org>	turn 205266 in to a no-op until the problem can be properly diagnosed
# 5e4bb93c	17-Mar-2010	Kip Macy <kmacy@FreeBSD.org>	Cache line align various structures and move volatile counters to not share a cache line with (mostly) immutable state Reviewed by: jeff@ MFC after: 7 days
# f12d6d2a	07-Jan-2010	Antoine Brodin <antoine@FreeBSD.org>	MFC r200129 to stable/8: Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros.
# 4e2d83fc	05-Dec-2009	Antoine Brodin <antoine@FreeBSD.org>	Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros. MFC after: 1 month
# e20a199f	25-Jan-2009	Jeff Roberson <jeff@FreeBSD.org>	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 6ab3b958	09-May-2007	Robert Watson <rwatson@FreeBSD.org>	Update stale comment on protecting UMA per-CPU caches: we now use critical sections rather than mutexes.
# af17e9a9	04-Aug-2005	Robert Watson <rwatson@FreeBSD.org>	Wrap inlines in uma_int.h in #ifdef _KERNEL so that uma_int.h can be used from memstat_uma.c for the purposes of kvm access without lots of additional unsafe includes. MFC after: 3 days
# 08ecce74	16-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Improve canonicalization of copyrights. Order copyrights by order of assertion (jeff, bmilekic, rwatson). Suggested ages ago by: bde MFC after: 1 week
# 2018f30c	15-Jul-2005	Mike Silbersack <silby@FreeBSD.org>	Increase the flags field for kegs from a 16 to a 32 bit value; we have exhausted all 16 flags.
# 2019094a	15-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Track UMA(9) allocation failures by zone, and export via sysctl. Requested by: victor cruceru <victor dot cruceru at gmail dot com> MFC after: 1 week
# 7a52a97e	14-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator statistics via a binary structure stream: - Add structure 'uma_stream_header', which defines a stream version, definition of MAXCPUs used in the stream, and the number of zone records in the stream. - Add structure 'uma_type_header', which defines the name, alignment, size, resource allocation limits, current pages allocated, preferred bucket size, and central zone + keg statistics. - Add structure 'uma_percpu_stat', which, for each per-CPU cache, includes the number of allocations and frees, as well as the number of free items in the cache. - When the sysctl is queried, return a stream header, followed by a series of type descriptions, each consisting of a type header followed by a series of MAXCPUs uma_percpu_stat structures holding per-CPU allocation information. Typical values of MAXCPU will be 1 (UP compiled kernel) and 16 (SMP compiled kernel). This query mechanism allows user space monitoring tools to extract memory allocation statistics in a machine-readable form, and to do so at a per-CPU granularity, allowing monitoring of allocation patterns across CPUs in order to better understand the distribution of work and memory flow over multiple CPUs. While here, also export the number of UMA zones as a sysctl vm.uma_count, in order to assist in sizing user swpace buffers to receive the stream. A follow-up commit of libmemstat(3), a library to monitor kernel memory allocation, will occur in the next few days. This change directly supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats rather than separately maintained mbuf allocator statistics. MFC after: 1 week
# 773df9ab	14-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	In addition to tracking allocs in the zone, also track frees. Add a zone free counter, as well as a cache free counter. MFC after: 1 week
# eafc7b54	16-Jun-2005	Alan Cox <alc@FreeBSD.org>	Increase UMA_BOOT_PAGES to prevent a crash during initialization. See http://docs.FreeBSD.org/cgi/mid.cgi?42AD8270.8060906 for a detailed description of the crash. Reported by: Eric Anderson Approved by: re (scottl) MFC after: 3 days
# 5d1ae027	29-Apr-2005	Robert Watson <rwatson@FreeBSD.org>	Modify UMA to use critical sections to protect per-CPU caches, rather than mutexes, which offers lower overhead on both UP and SMP. When allocating from or freeing to the per-cpu cache, without INVARIANTS enabled, we now no longer perform any mutex operations, which offers a 1%-3% performance improvement in a variety of micro-benchmarks. We rely on critical sections to prevent (a) preemption resulting in reentrant access to UMA on a single CPU, and (b) migration of the thread during access. In the event we need to go back to the zone for a new bucket, we release the critical section to acquire the global zone mutex, and must re-acquire the critical section and re-evaluate which cache we are accessing in case migration has occured, or circumstances have changed in the current cache. Per-CPU cache statistics are now gathered lock-free by the sysctl, which can result in small races in statistics reporting for caches. Reviewed by: bmilekic, jeff (somewhat) Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others
# 8076cb52	16-Feb-2005	Bosko Milekic <bmilekic@FreeBSD.org>	Well, it seems that I pre-maturely removed the "All rights reserved" statement from some files, so re-add it for the moment, until the related legalese is sorted out. This change affects: sys/kern/kern_mbuf.c sys/vm/memguard.c sys/vm/memguard.h sys/vm/uma.h sys/vm/uma_core.c sys/vm/uma_dbg.c sys/vm/uma_dbg.h sys/vm/uma_int.h
# 60727d8b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# 7b871205	25-Dec-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Add my copyright and update Jeff's copyright on UMA source files, as per his request. Discussed with: Jeffrey Roberson
# 6fc96493	26-Nov-2004	Olivier Houchard <cognet@FreeBSD.org>	Remove useless casts.
# 244f4554	29-Jul-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Rework the way slab header storage space is calculated in UMA. - zone_large_init() stays pretty much the same. - zone_small_init() will try to stash the slab header in the slab page being allocated if the amount of calculated wasted space is less than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular case). If the amount of wasted space is >= UMA_MAX_WASTE, then UMA_ZONE_OFFPAGE will be set and the slab header will be allocated separately for better use of space. - uma_startup() calculates the maximum ipers required in offpage slabs (so that the offpage slab header zone(s) can be sized accordingly). The algorithm used to calculate this replaces the old calculation (which only happened to work coincidentally). We now iterate over possible object sizes, starting from the smallest one, until we determine that wastedspace calculated in zone_small_init() might end up being greater than UMA_MAX_WASTE, at which point we use the found object size to compute the maximum possible ipers. The reason this works is because: - wastedspace versus objectsize is a see-saw function with local minima all equal to zero and local maxima growing directly proportioned to objectsize. This implies that for objects up to or equal a certain objectsize, the see-saw remains entirely below UMA_MAX_WASTE, so for those objectsizes it is impossible to ever go OFFPAGE for slab headers. - ipers (items-per-slab) versus objectsize is an inversely proportional function which falls off very quickly (very large for small objectsizes). - To determine the maximum ipers we'll ever need from OFFPAGE slab headers we first find the largest objectsize for which we are guaranteed to not go offpage for and use it to compute ipers (as though we were offpage). Since the only objectsizes allowed to go offpage are bigger than the found objectsize, and since ipers vs objectsize is inversely proportional (and monotonically decreasing), then we are guaranteed that the ipers computed is always >= what we will ever need in offpage slab headers. - Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly padded) size of each freelist index so that offset calculations are fixed. This might fix weird data corruption problems and certainly allows ARM to now boot to at least single-user (via simulator). Tested on i386 UP by me. Tested on sparc64 SMP by fenner. Tested on ARM simulator to single-user by cognet.
# 099a0e58	31-May-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# c19aa340	17-Jan-2004	Alan Cox <alc@FreeBSD.org>	Increase UMA_BOOT_PAGES because of changes to pv entry initialization in revision 1.457 of i386/i386/pmap.c.
# 925692ca	21-Dec-2003	Alan Cox <alc@FreeBSD.org>	- Significantly reduce the number of preallocated pv entries in pmap_init(). Such a large preallocation is unnecessary and wastes nearly eight megabytes of kernel virtual address space per gigabyte of managed physical memory. - Increase UMA_BOOT_PAGES by two. This enables the removal of pmap_pv_allocf(). (Note: this function was only used during initialization, specifically, after pmap_init() but before pmap_init2(). During pmap_init2(), a new allocator is installed.)
# 9643769a	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove the working-set algorithm. Instead, use the per cpu buckets as the working set cache. This has several advantages. Firstly, we never touch the per cpu queues now in the timeout handler. This removes one more reason for having per cpu locks. Secondly, it reduces the size of the zone by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This tidies up other logic as well. - The 'destroy' flag no longer needs to be passed to zone_drain() since it always frees everything in the zone's slabs. - cache_drain() is now only called from zone_dtor() and so it destroys by default. It also does not need the destroy parameter now.
# 3e0cab95	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove the cache colorization code. We can't use it due to all of the broken consumers of the malloc interface who assume that the allocated address will be an even multiple of the size. - Remove disabled time delay code on uma_reclaim(). The comment there said it all. It was not an effective strategy and it should not be left in #if 0'd for all eternity.
# b60f5b79	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags by accepting the user supplied flags directly. Previously this was not done so that flags for the same field would not be defined in two different files. Add comments in each header instructing future developers on how now to shoot their feet. - Fix a test for !OFFPAGE which should have been a test for HASH. This would have caused a panic if we had ever destructed a malloc zone. This also opens up the possibility that other zones could use the vsetobj() method rather than a hash.
# cae33c14	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Initialize a pool of bucket zones so that we waste less space on zones that don't cache as many items. - Introduce the bucket_alloc(), bucket_free() functions to wrap bucket allocation. These functions select the appropriate bucket zone to allocate from or free to. - Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects the count of free items in the bucket. This gets rid of many unnatural subtractions by 1 throughout the code. - Add ub_entries which reflects the number of entries possibly held in a bucket.
# 20e8e865	11-Aug-2003	Bosko Milekic <bmilekic@FreeBSD.org>	- When deciding whether to init the zone with small_init or large_init, compare the zone element size (+1 for the byte of linkage) against UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE. Add a KASSERT in zone_small_init to make sure that the computed ipers (items per slab) for the zone is not zero, despite the addition of the check, just to be sure (this part submitted by: silby) - UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the case of bucket allocations, but in addition to that also ensures that we don't setup the zone with OFFPAGE slab headers allocated from the slabzone. This means that we're not allowed to have a UMA_ZONE_VM zone initialized for large items (zone_large_init) because it would require the slab headers to be allocated from slabzone, and hence kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd before kmem_map is suballoc'd from kernel_map, which is why this change is necessary.
# f828e5be	29-Jul-2003	Jeff Roberson <jeff@FreeBSD.org>	- Get rid of the ill-conceived uz_cachefree member of uma_zone. - In sysctl_vm_zone use the per cpu locks to read the current cache statistics this makes them more accurate while under heavy load. Submitted by: tegge
# d88797c2	25-Jun-2003	Bosko Milekic <bmilekic@FreeBSD.org>	Move the pcpu lock out of the uma_cache and instead have a single set of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN * sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact). No Objections from jeff.
# c5d771b8	31-May-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Prepend _ to internal union members to avoid ambiguity. Found by: FlexeLint
# 48eea375	31-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add support for machine dependant page allocation routines. MD code may define UMA_MD_SMALL_ALLOC to make use of this feature. Reviewed by: peter, jake
# f461cf22	19-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Use my freebsd email alias in the copyright. - Remove redundant instances of my email alias in the file summary.
# 99571dc3	18-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH. - Remove all instances of the mallochash. - Stash the slab pointer in the vm page's object pointer when allocating from the kmem_obj. - Use the overloaded object pointer to find slabs for malloced memory.
# e602ba25	29-Jun-2002	Julian Elischer <julian@FreeBSD.org>	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# 18aa2de5	17-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	- Introduce the new M_NOVM option which tells uma to only check the currently allocated slabs and bucket caches for free items. It will not go ask the vm for pages. This differs from M_NOWAIT in that it not only doesn't block, it doesn't even ask. - Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This tells uma that it should only allocate buckets out of the bucket cache, and not from the VM. It does this by using the M_NOVM option to zalloc when getting a new bucket. This is so that the VM doesn't recursively enter itself while trying to allocate buckets for vm_map_entry zones. If there are already allocated buckets when we get here we'll still use them but otherwise we'll skip it. - Use the ZONE_VM flag on vm map entries and pv entries on x86.
# 28bc4419	29-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own mutex class. Currently this is only used for kmapentzone because kmapents are are potentially allocated when freeing memory. This is not dangerous though because no other allocations will be done while holding the kmapentzone lock.
# af7f9b97	13-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Fix the calculation that determines uz_maxpages. It was off for large zones. Fortunately we have no large zones with maximums specified yet, so it wasn't breaking anything. Implement blocking when a zone exceeds the maximum and M_WAITOK is specified. Previously this just failed like the old zone allocator did. The old zone allocator didn't support WAITOK/NOWAIT though so we should do what we advertise. While I was in there I cleaned up some more zalloc logic to further simplify that code path and reduce redundant code. This was needed to make the blocking work properly anyway.
# 1d4cb54b	08-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Quiet witness warnings about acquiring several zone locks. In the case that this happens it is OK.
# a553d4b8	07-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Rework most of the bucket allocation and free code so that per cpu locks are never held across blocking operations. Also, fix two other lock order reversals that were exposed by jhb's witness change. The free path previously had a bug that would cause it to skip the free bucket list in some cases and go straight to allocating a new bucket. This has been fixed as well. These changes made the bucket handling code much cleaner and removed quite a few lock operations. This should be marginally faster now. It is now possible to call malloc w/o Giant and avoid any witness warnings. This still isn't entirely safe though because malloc_type statistics are not protected by any lock.
# c235bfa5	07-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Spelling correction; s/seperate/separate/g Submitted by: eric
# 6008862b	04-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# f22a4b62	27-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks with this flag. Remove the dup_list and dup_ok code from subr_witness. Now we just check for the flag instead of doing string compares. Also, switch the process lock, process group lock, and uma per cpu locks over to this interface. The original mechanism did not work well for uma because per cpu lock names are unique to each zone. Approved by: jhb
# 8355f576	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@