#
0020e1b6 |
|
10-Apr-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Revert "sendfile: mark it explicitly as a TCP only feature" This reverts commit 3b7aa842e27dcf07181f161b1abde0067ed51e97.
|
#
3b7aa842 |
|
08-Apr-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sendfile: mark it explicitly as a TCP only feature Back in 2015 when it turned non-blocking, it was working with PF_UNIX and it may still work. However, the usefullness of such application of sendfile(2) is questionable. Disable the feature while unix/stream is under refactoring. Relnotes: yes
|
#
61cc4830 |
|
18-Jan-2024 |
Alfredo Mazzinghi <am2419@cl.cam.ac.uk> |
Abstract UIO allocation and deallocation. Introduce the allocuio() and freeuio() functions to allocate and deallocate struct uio. This hides the actual allocator interface, so it is easier to modify the sub-allocation layout of struct uio and the corresponding iovec array. Obtained from: CheriBSD Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: CHaOS, EPSRC grant EP/V000292/1 Differential Revision: https://reviews.freebsd.org/D43711
|
#
d0adc2f2 |
|
25-Dec-2023 |
Mark Johnston <markj@FreeBSD.org> |
sendfile: Explicitly ignore errors from copyout() There is a documented bug in sendfile.2 which notes that sendfile(2) does not raise an error if it fails to copy out the number of bytes written. Explicitly ignore the error from copyout() calls in preparation for annotating copyout() with __result_use_check. Reviewed by: glebius, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D43129
|
#
685dc743 |
|
16-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
|
#
73ee5756 |
|
31-Mar-2023 |
Randall Stewart <rrs@FreeBSD.org> |
Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features. So stack switching as always been a bit of a issue. We currently use a break before make setup which means that if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider for sure). Once all this lands I will then update rack to begin using all these new features. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39210
|
#
f45feecf |
|
22-Sep-2022 |
Mateusz Guzik <mjg@FreeBSD.org> |
vfs: add vn_getsize getattr is very expensive and in important cases only gets called to get the size. This can be optimized with a dedicated routine which obtains that statistic. As a step towards that goal make size-only consumers use a dedicated routine. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D37885
|
#
3212ad15 |
|
07-Sep-2022 |
Mateusz Guzik <mjg@FreeBSD.org> |
Add getsock All but one consumers of getsock_cap only pass 4 arguments. Take advantage of it.
|
#
e7d02be1 |
|
17-Aug-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
|
#
43283184 |
|
12-May-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: use socket buffer mutexes in struct socket directly Since c67f3b8b78e the sockbuf mutexes belong to the containing socket, and socket buffers just point to it. In 74a68313b50 macros that access this mutex directly were added. Go over the core socket code and eliminate code that reaches the mutex by dereferencing the sockbuf compatibility pointer. This change requires a KPI change, as some functions were given the sockbuf pointer only without any hint if it is a receive or send buffer. This change doesn't cover the whole kernel, many protocols still use compatibility pointers internally. However, it allows operation of a protocol that doesn't use them. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35152
|
#
fe27f1db |
|
25-Dec-2021 |
Alexander Motin <mav@FreeBSD.org> |
kern: Remove CTLFLAG_NEEDGIANT from some sysctls. MFC after: 2 weeks
|
#
e3ba94d4 |
|
09-Nov-2021 |
John Baldwin <jhb@FreeBSD.org> |
Don't require the socket lock for sorele(). Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741
|
#
523d58aa |
|
07-Sep-2021 |
Mark Johnston <markj@FreeBSD.org> |
socket: Remove unneeded SOLISTENING checks Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate some racy SOLISTENING() checks. No functional change intended. Reviewed by: tuexen MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31660
|
#
f94acf52 |
|
07-Sep-2021 |
Mark Johnston <markj@FreeBSD.org> |
socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657
|
#
916c61a5 |
|
21-May-2021 |
Mark Johnston <markj@FreeBSD.org> |
Fix handling of errors from pru_send(PRUS_NOTREADY) PRUS_NOTREADY indicates that the caller has not yet populated the chain with data, and so it is not ready for transmission. This is used by sendfile (for async I/O) and KTLS (for encryption). In particular, if pru_send returns an error, the caller is responsible for freeing the chain since other implicit references to the data buffers exist. For async sendfile, it happens that an error will only be returned if the connection was dropped, in which case tcp_usr_ready() will handle freeing the chain. But since KTLS can be used in conjunction with the regular socket I/O system calls, many more error cases - which do not result in the connection being dropped - are reachable. In these cases, KTLS was effectively assuming success. So: - Change sosend_generic() to free the mbuf chain if pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the chain at that point. - Similarly, in vn_sendfile() change the !async I/O && KTLS case to free the chain. - If async I/O is still outstanding when pru_send fails in vn_sendfile(), set an error in the sfio structure so that the connection is aborted and the mbuf chain is freed. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349
|
#
52a99c72 |
|
02-Apr-2021 |
Mark Johnston <markj@FreeBSD.org> |
sendfile: Fix error initialization in sendfile_getobj() Reviewed by: chs, kib Reported by: jhb Fixes: faa998f6ff695 MFC after: 1 day Differential Revision: https://reviews.freebsd.org/D29540
|
#
faa998f6 |
|
25-Feb-2021 |
Mark Johnston <markj@FreeBSD.org> |
sendfile: Use the pager size to determine the file extent when possible Previously sendfile would issue a VOP_GETATTR and use the returned size, i.e., the file size. When paging in file data, sendfile_swapin() will use the pager to determine whether it needs to zero-fill, most often because of a hole in a sparse file. An attempt to page in beyond the end of a file is treated this way, and occurs when the requested page is past the end of the pager. In other words, both the file size and pager size were used interchangeably. With ZFS, updates to the pager and file sizes are not synchronized by the exclusive vnode lock, at least partially due to its use of MNTK_SHARED_WRITES. In particular, the pager size is updated after the file size, so in the presence of a writer concurrently extending the file, sendfile could incorrectly instantiate "holes" in the page cache pages backing the file, which manifests as data corruption when reading the file back from the page cache. The on-disk copy is unaffected. Fix this by consistently using the pager size when available. Reported by: dumbbell Reviewed by: chs, kib Tested by: dumbbell, pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28811
|
#
214257da |
|
03-Jan-2021 |
Mark Johnston <markj@FreeBSD.org> |
sendfile: Clear page pointers when handling a pager error When INVARIANTS is configred, the sendfile_iodone() callback verifies that pages attached to the sendfile header are wired, but we unwire all such pages after a synchronous pager error, before calling sendfile_iodone(). Reported by: pho Tested by: pho Sponsored by: The FreeBSD Foundation
|
#
26b23f07 |
|
26-Dec-2020 |
Mark Johnston <markj@FreeBSD.org> |
sendfile: Ensure that sfio->npages is initialized We initialize sfio->npages only when some I/O is required to satisfy the request. However, sendfile_iodone() contains an INVARIANTS-only check that references sfio->npages, and this check is executed even if no I/O is performed, so the check may use an uninitialized value. Fix the problem by initializing sfio->npages earlier. Note that sendfile_swapin() always initializes the page array. In some rare cases we need to trim the page array so ensure that sfio->npages gets updated accordingly. Reported by: syzkaller (with KASAN) Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27726
|
#
cd853791 |
|
27-Nov-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (*). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav (*) Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225
|
#
6fed89b1 |
|
01-Sep-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
kern: clean up empty lines in .c and .h files
|
#
c2ea3d44 |
|
05-Jun-2020 |
Chuck Silvers <chs@FreeBSD.org> |
Fix hang due to missing unbusy in sendfile when an async data I/O fails. r359473 removed the page unbusy logic from sendfile_iodone() because when vm_pager_get_pages_async() would return an error after failing to start the async I/O (eg. because VOP_BMAP failed), sendfile_swapin() would also unbusy the pages, and it was wrong to unbusy twice. However this breaks the case where vm_pager_get_pages_async() succeeds in starting an async I/O and the async I/O is what fails. In this case, sendfile_iodone() must unbusy the pages, and because sendfile_iodone() doesn't know which case it is in, sendfile_iodone() must always unbusy pages and relookup pages which have been substituted with bogus_page, which in turn means that sendfile_swapin() must never do unbusy or relookup for pages which have been given to vm_pager_get_pages_async(), even if there is an error. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25136
|
#
61664ee7 |
|
02-May-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Step 4.2: start divorce of M_EXT and M_EXTPG They have more differencies than similarities. For now there is lots of code that would check for M_EXT only and work correctly on M_EXTPG buffers, so still carry M_EXT bit together with M_EXTPG. However, prepare some code for explicit check for M_EXTPG. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
|
#
6edfd179 |
|
02-May-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Step 4.1: mechanically rename M_NOMAP to M_EXTPG Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
|
#
7b6c99d0 |
|
02-May-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed: s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
|
#
bccf6e26 |
|
02-May-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
|
#
0c103266 |
|
02-May-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Continuation of multi page mbuf redesign from r359919. The following series of patches addresses three things: Now that array of pages is embedded into mbuf, we no longer need separate structure to pass around, so struct mbuf_ext_pgs is an artifact of the first implementation. And struct mbuf_ext_pgs_data is a crutch to accomodate the main idea r359919 with minimal churn. Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP. The namespace for the newfeature is somewhat inconsistent and sometimes has a lengthy prefixes. In these patches we will gradually bring the namespace to "m_epg" prefix for all mbuf fields and most functions. Step 1 of 4: o Anonymize mbuf_ext_pgs_data, embed in m_ext o Embed mbuf_ext_pgs o Start documenting all this entanglement Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
|
#
b7eae758 |
|
28-Apr-2020 |
Mark Johnston <markj@FreeBSD.org> |
Make sendfile(SF_SYNC)'s CV wait interruptible. Otherwise, since the CV is not signalled until data is drained from the socket, it is trivial to create an unkillable process using sendfile(SF_SYNC) and a process-private PF_LOCAL socket pair. In particular, the cv_wait() in sendfile() does not get interrupted until data is drained from the receiving socket buffer. Reported by: pho Discussed with: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
|
#
acf65eef |
|
17-Apr-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
sendfile: When all io finished, assert that sfio->pa[] is in expected state. It must contain fully restored contigous run of the wired pages from the object, except possible trimmed tail. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
#
e795a040 |
|
17-Apr-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
The pa argument for sendfile_iodone() is not necessary a slice of sfio->pa. It is true for zfs, but it is not for e.g. vnode or buffer pagers. When fixing bogus pages, fix them in both places. Rely on the fact that pa[0] must have been invalid so it cannot be bogus. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
#
23feb563 |
|
14-Apr-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself. While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213
|
#
d25f1b21 |
|
11-Apr-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
sendfile_iodone: correct calculation of the page index for relookup. This is yet another bug in r359473. Reported and tested by: delphij Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
#
f709eee6 |
|
09-Apr-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
Do not pass bogus page to mbufs. This is a bug in r359473. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
#
c506a638 |
|
30-Mar-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
kern_sendfile.c: fix bugs with handling of busy page states. - Do not call into a vnode pager while leaving some pages from the same block as the current run, xbusy. This immediately deadlocks if pager needs to instantiate the buffer. - Only relookup bogus pages after io finished, otherwise we might obliterate the valid pages by out of date disk content. While there, expand the comment explaining this pecularity. - Do not double-unbusy on error. Split unbusy for error case, which is left in the sendfile_swapin(), from the more properly coded normal case in sendfile_iodone(). - Add an XXXKIB comment explaining the serious bug in the validation algorithm, not fixed by this patch series. PR: 244713 Reviewed by: glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24038
|
#
d8663536 |
|
30-Mar-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
kern_sendfile.c: do not release sfio reference on error. It is already done by sendfile_iodone(), now consistently for all errors. This de-facto reverts r358597, after r359466. Reviewed by: glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24038
|
#
59e1ac9d |
|
30-Mar-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
kern_sendfile.c: wait for all in-flight ios completion before unwiring pages. Reviewed by: glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24038
|
#
8f0a223c |
|
30-Mar-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
kern_sendfile.c: add specific malloc type. Now sfio leaks are more easily seen in the malloc statistics than e.g. just wired or busy pages leak. Reviewed by: glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24038
|
#
0ac8511a |
|
30-Mar-2020 |
Konstantin Belousov <kib@FreeBSD.org> |
kern_sendfile.c style: order headers alphabetically. Reviewed by: glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24038
|
#
db4493f7 |
|
13-Mar-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
sendfile() does currently not support SCTP sockets. Therefore, fail the call. Reviewed by: markj@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D24059
|
#
37bf88e7 |
|
03-Mar-2020 |
Chuck Silvers <chs@FreeBSD.org> |
if vm_pager_get_pages_async() returns an error, release the sfio->nios refcount that we took earlier that represents the I/O that ended up not being started. Reviewed by: glebius Approved by: imp (mentor) Sponsored by: Netflix
|
#
6be21eb7 |
|
28-Feb-2020 |
Jeff Roberson <jeff@FreeBSD.org> |
Provide a lock free alternative to resolve bogus pages. This is not likely to be much of a perf win, just a nice code simplification. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23866
|
#
7aaf252c |
|
28-Feb-2020 |
Jeff Roberson <jeff@FreeBSD.org> |
Convert a few triviail consumers to the new unlocked grab API. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23847
|
#
7029da5c |
|
26-Feb-2020 |
Pawel Biernacki <kaktus@FreeBSD.org> |
Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
|
#
6bc27f08 |
|
25-Feb-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Generalize resources freeing in sendfile with different scenarios. Now we execute sendfile_iodone() in all possible cases, which guarantees that vm_object_pip_wakeup() is called and sfio structure is freed. At the beginning of sendfile initialize sfio->m to NULL, that would indicate that the mbuf chain either doesn't exist, or belongs to the syscall (not to I/O completion). Fill sfio->m only at a point when we are positive that there are I/Os ongoing and before releasing syscall's reference on sfio. In sendfile_iodone() perform vm_object_pip_wakeup() once last reference is released, then check for sfio->m. NULL pointer indicates that we need only to free the memory. Reviewed by: jtl, gallatin
|
#
f85e1a80 |
|
25-Feb-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make ktls_frame() never fail. Caller must supply correct mbufs. This makes sendfile code a bit simplier.
|
#
69302907 |
|
25-Feb-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
When sendfile_swapin() sweeps through pages in search for a bogus page skip first and last pages. This is a micro optimisation.
|
#
d3631aa5 |
|
05-Feb-2020 |
Mark Johnston <markj@FreeBSD.org> |
Avoid releasing object PIP in vn_sendfile() if no pages were grabbed. sendfile(2) optionally takes a set of headers that get prepended to the file data. If the request length is less than that of the headers, sendfile may not allocate an sfio structure, in which case its pointer is null and we should be careful not to dereference. This was introduced in r356902. Reported by: syzkaller Sponsored by: The FreeBSD Foundation
|
#
91e31c3c |
|
22-Jan-2020 |
Jeff Roberson <jeff@FreeBSD.org> |
Consistently use busy and vm_page_valid() rather than touching page bits directly. This improves API compliance, asserts, etc. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23283
|
#
d6e13f3b |
|
19-Jan-2020 |
Jeff Roberson <jeff@FreeBSD.org> |
Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033
|
#
b249ce48 |
|
03-Jan-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427
|
#
b631c36f |
|
24-Nov-2019 |
Konstantin Belousov <kib@FreeBSD.org> |
Record part of the owner struct thread pointer into busy_lock. Record as much bits from curthread into busy_lock as fits. Low bits for struct thread * representation are zero due to struct and zone alignment, and they leave space for busy flags (perhaps except statically allocated thread0). Upper bits are not very interesting for assert, and in most practical situations recorded value should allow to manually identify the owner with certainity. Assert that unbusy is performed by the owner, except few places where unbusy is done in io completion handler. For this case, add _unchecked variants of asserts and unbusy primitives. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22298
|
#
b8c92303 |
|
06-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
If vm_pager_get_pages_async() returns an error synchronously we leak wired and busy pages. Add code that would carefully cleanups the state in case of synchronous error return. Cover a case when a first I/O went on asynchronously, but second or N-th returned error synchronously. In collaboration with: chs Reviewed by: jtl, kib
|
#
9e14430d |
|
08-Oct-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891
|
#
fee2a2fa |
|
09-Sep-2019 |
Mark Johnston <markj@FreeBSD.org> |
Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486
|
#
818d7553 |
|
27-Aug-2019 |
John Baldwin <jhb@FreeBSD.org> |
Only define the 'tls' member of sfio in KERN_TLS is defined. This field was not initialized in the !KERN_TLS case triggering an assertion failure when using sendfile(2). Reported by: pho, asomers Sponsored by: Netflix
|
#
b2e60773 |
|
26-Aug-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
|
#
814f33aa |
|
06-Aug-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Since r350426 this KASSERT doesn't serve any useful purpose.
|
#
98549e2d |
|
29-Jul-2019 |
Mark Johnston <markj@FreeBSD.org> |
Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986
|
#
fca79580 |
|
19-Jul-2019 |
Alan Somers <asomers@FreeBSD.org> |
sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error PR: 236466 Sponsored by: The FreeBSD Foundation
|
#
cec06a3e |
|
28-Jun-2019 |
John Baldwin <jhb@FreeBSD.org> |
Add support for using unmapped mbufs with sendfile(2). This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl. It is disabled by default. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616
|
#
19c40c76 |
|
11-Jun-2019 |
John Baldwin <jhb@FreeBSD.org> |
Trim an extra space.
|
#
47f1ea51 |
|
16-Oct-2018 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Plug sendfile(2) on a listening socket with proper error code. Reported by: ngie Reviewed by: ngie Approved by: re (delphij)
|
#
147cd40f |
|
29-May-2018 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Revert second chunk of r333860. The warning from gcc is false positive. The npages won't be ever used in no space case.
|
#
7fd68414 |
|
18-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
sendfile: annotate unused value and ensure that npages is actually initialized
|
#
cbd92ce6 |
|
09-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month
|
#
1b5c869d |
|
04-May-2018 |
Mark Johnston <markj@FreeBSD.org> |
Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280
|
#
6469bdcd |
|
06-Apr-2018 |
Brooks Davis <brooks@FreeBSD.org> |
Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941
|
#
0eb50f9c |
|
18-Mar-2018 |
Mark Johnston <markj@FreeBSD.org> |
Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625
|
#
1d3a1bcf |
|
07-Feb-2018 |
Mark Johnston <markj@FreeBSD.org> |
Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943
|
#
b4f55763 |
|
05-Jan-2018 |
Gleb Smirnoff <glebius@FreeBSD.org> |
In sendfile_iodone() both pru_abort and sorele need to be executed with proper VNET context set. Reported by: sbruno MFC after: 2 weeks
|
#
41bf90bb |
|
13-Oct-2017 |
Alan Cox <alc@FreeBSD.org> |
Address two problems with sendfile(..., SF_NOCACHE) and apply one "optimization". First, sendfile(..., SF_NOCACHE) frees pages without checking whether those pages are mapped. This can leave the system with mappings to free or repurposed pages. Second, a page can be busied between the time of the current busy test and acquiring the object lock. Essentially, the test performed before the object lock is acquired can only be regarded as an optimization to short-circuit further work on the page. It cannot, however, be relied upon to prove that it is safe to free the page. Third, when sendfile(..., SF_NOCACHE) was originally implemented, vm_page_deactivate_noreuse() did not yet exist. Use vm_page_deactivate_noreuse() instead of vm_page_deactivate(), because it comes closer to freeing the page. In collaboration with: glebius Discussed with: gallatin, kib, markj X-MFC after: r324448
|
#
1f9916ed |
|
10-Oct-2017 |
Sean Bruno <sbruno@FreeBSD.org> |
match sendfile() error handling to send(). Sendfile() should match the error checking order of send() which is currently: SBS_CANTSENDMORE so_error SS_ISCONNECTED Submitted by: Jason Eggleston <jason@eggnet.com> Reviewed by: glebius MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12633
|
#
009ad572 |
|
09-Oct-2017 |
Sean Bruno <sbruno@FreeBSD.org> |
Revert r324405 at the request of the submitter pending better solution. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks
|
#
9c82bec4 |
|
09-Oct-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Improvements to sendfile(2) mbuf free routine. o Fall back to default m_ext free mech, using function pointer in m_ext_free, and remove sf_ext_free() called directly from mbuf code. Testing on modern CPUs showed no regression. o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses SF_SYNC flag. Lack of the flag allows us not to dereference ext_arg2, saving from a cache line miss. o Create function sendfile_free_page() that later will be used, for multi-page mbufs. For now compiler will inline it into sendfile_free_mext(). In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615
|
#
75c8dfb6 |
|
07-Oct-2017 |
Sean Bruno <sbruno@FreeBSD.org> |
Check so_error early in sendfile() call. Prior to this patch, if a connection was reset by the remote end, sendfile() would just report ENOTCONN instead of ECONNRESET. Submitted by: Jason Eggleston <jason@eggnet.com> Reviewed by: glebius Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12575
|
#
d37aa3cc |
|
13-Sep-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Use soref() in sendfile(2) instead fhold() to reference a socket. The problem is that fdrop() requires syscall context, as it may enter sleep in some cases. The reason to use it in the original non-blocking sendfile implementation, was to avoid use of global ACCEPT_LOCK() on every I/O completion. Now in head sorele() no longer requires this lock.
|
#
af0460be |
|
11-Aug-2017 |
Mark Johnston <markj@FreeBSD.org> |
Have sendfile_swapin() use vm_page_grab_pages(). Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11942
|
#
6921451d |
|
11-Aug-2017 |
Alan Cox <alc@FreeBSD.org> |
An invalid page can't be dirty. Reviewed by: kib MFC after: 1 week
|
#
ef3266d5 |
|
09-Aug-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Plug uninitialized stack variable leak in sendfile(2). Reported by: Ilja Van Sprundel <ivansprundel ioactive.com> Submitted by: Domagoj Stolfa <domagoj.stolfa gmail.com> MFC after: 1 week Security: uninitialized stack variable leak
|
#
d712b799 |
|
03-Jun-2017 |
Alan Cox <alc@FreeBSD.org> |
The data type returned by vmoff() is too narrow in its range. This could break the transmission of files longer than 4 GB on 32-bit architectures. Reviewed by: glebius, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10019
|
#
9e3c8bd3 |
|
24-Mar-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make sendfile(2) more robust against file change. This fixes a possible crash when the file shrinks. This also fixes sendfile(2) not sending more data in a case when the file grows, and the request is open-ended or specifies a size that is greater than old file size. PR: 217789 Reviewed by: gallatin MFC after: 10 days
|
#
bfc8c24c |
|
04-Jan-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib
|
#
00b5ffde |
|
17-Nov-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Add flag SF_USER_READAHEAD to sendfile(2). When specified, the syscall won't do any speculations about readahead, and use exactly the amount of readahead specified by user. E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee that no readahead at all will be performed.
|
#
5dba303d |
|
17-Nov-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Use bogus_page to properly reduce number of I/Os in sendfile(2). The new sendfile_swapin() loop works this way: - Find first invalid page in the request. - Do vm_pager_has_page() and get count of pages, that can be taken in single I/O. - Trim valid pages from the end of the request. - Cycle through the request and substitute to bogus_page all valid pages that are in the middle of the request. - After I/O launched (pager copies array of pages into buf(9), it is important to restore proper page pointers with help vm_page_lookup(). Count bogus pages used and report them in sendfile stats.
|
#
a2d8f9d2 |
|
22-Sep-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Fix regression from r297400, which truncates headers in case of low socket buffer and put a small optimization for low socket buffer case: - Do not hack uio_resid, and let m_uiotombuf() properly take care of it. This fixes truncation of headers at low buffer. - If headers ate all the space, jump right to the end of the cycle, to avoid doing single page I/O and allocating zero length mbuf. - Clear hdr_uio only if space is positive, which indicates that all uio was copied in. Reviewed by: pluknet, jtl, emax, rrs, lstewart, emax, gallatin, scottl
|
#
85b0f9de |
|
22-Sep-2016 |
Mariusz Zaborski <oshogbo@FreeBSD.org> |
capsicum: propagate rights on accept(2) Descriptor returned by accept(2) should inherits capabilities rights from the listening socket. PR: 201052 Reviewed by: emaste, jonathan Discussed with: many Differential Revision: https://reviews.freebsd.org/D7724
|
#
69a28758 |
|
15-Sep-2016 |
Ed Maste <emaste@FreeBSD.org> |
Renumber license clauses in sys/kern to avoid skipping #3
|
#
9c64cfe5 |
|
29-Mar-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
The sendfile(2) allows to send extra data from userspace before the file data (headers). Historically the size of the headers was not checked against the socket buffer space. Application could easily overcommit the socket buffer space. With the new sendfile (r293439) the problem remained, but a KASSERT was inserted that checked that amount of data written to the socket matches its space. In case when size of headers is bigger that socket space, KASSERT fires. Without INVARIANTS the new sendfile won't panic, but would report incorrect amount of bytes sent. o With this change, the headers copyin is moved down into the cycle, after the sbspace() check. The uio size is trimmed by socket space there, which fixes the overcommit problem and its consequences. o The compatibility handling for FreeBSD 4 sendfile headers API is pushed up the stack to syscall wrappers. This required a copy and paste of the code, but in turn this allowed to remove extra stack carried parameter from fo_sendfile_t, and embrace entire compat code into #ifdef. If in future we got more fo_sendfile_t function, the copy and paste level would even reduce. Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru> Tested by: Vitalij Satanivskij <satan ukr.net> Sponsored by: Netflix
|
#
56a5f52e |
|
29-Feb-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
New way to manage reference counting of mbuf external storage. The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used. The first mbuf to attach a cluster stores the refcount. The further mbufs to reference the cluster point at refcount in the first mbuf. The first mbuf is freed only when the last reference is freed. The benefit over refcounts stored in separate slabs is that now refcounts of different, unrelated mbufs do not share a cache line. For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd() becomes void, making widely used M_EXTADD macro safe. For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization exactly against the cache aliasing problem with regular refcounting. Discussed with: rrs, rwatson, gnn, hiren, sbruno, np Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D5396 Sponsored by: Netflix
|
#
33a2a37b |
|
21-Jan-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
- Separate sendfile(2) implementation from uipc_syscalls.c into separate file. Claim my copyright. - Provide more comments, better function and structure names. - Sort out unneeded includes from resulting two files. No functional changes.
|