History log of /freebsd-current/sys/kern/kern_sendfile.c
Revision Date Author Comments
# 0020e1b6 10-Apr-2024 Gleb Smirnoff <glebius@FreeBSD.org>

Revert "sendfile: mark it explicitly as a TCP only feature"

This reverts commit 3b7aa842e27dcf07181f161b1abde0067ed51e97.


# 3b7aa842 08-Apr-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sendfile: mark it explicitly as a TCP only feature

Back in 2015 when it turned non-blocking, it was working with PF_UNIX
and it may still work. However, the usefullness of such application
of sendfile(2) is questionable. Disable the feature while unix/stream
is under refactoring.

Relnotes: yes


# 61cc4830 18-Jan-2024 Alfredo Mazzinghi <am2419@cl.cam.ac.uk>

Abstract UIO allocation and deallocation.

Introduce the allocuio() and freeuio() functions to allocate and
deallocate struct uio. This hides the actual allocator interface, so it
is easier to modify the sub-allocation layout of struct uio and the
corresponding iovec array.

Obtained from: CheriBSD
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: CHaOS, EPSRC grant EP/V000292/1
Differential Revision: https://reviews.freebsd.org/D43711


# d0adc2f2 25-Dec-2023 Mark Johnston <markj@FreeBSD.org>

sendfile: Explicitly ignore errors from copyout()

There is a documented bug in sendfile.2 which notes that sendfile(2)
does not raise an error if it fails to copy out the number of bytes
written. Explicitly ignore the error from copyout() calls in
preparation for annotating copyout() with __result_use_check.

Reviewed by: glebius, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43129


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 73ee5756 31-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.

So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210


# f45feecf 22-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vn_getsize

getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885


# 3212ad15 07-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

Add getsock

All but one consumers of getsock_cap only pass 4 arguments.
Take advantage of it.


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 43283184 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use socket buffer mutexes in struct socket directly

Since c67f3b8b78e the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b50 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152


# fe27f1db 25-Dec-2021 Alexander Motin <mav@FreeBSD.org>

kern: Remove CTLFLAG_NEEDGIANT from some sysctls.

MFC after: 2 weeks


# e3ba94d4 09-Nov-2021 John Baldwin <jhb@FreeBSD.org>

Don't require the socket lock for sorele().

Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741


# 523d58aa 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Remove unneeded SOLISTENING checks

Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate
some racy SOLISTENING() checks. No functional change intended.

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31660


# f94acf52 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Rename sb(un)lock() and interlock with listen(2)

In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().

Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31657


# 916c61a5 21-May-2021 Mark Johnston <markj@FreeBSD.org>

Fix handling of errors from pru_send(PRUS_NOTREADY)

PRUS_NOTREADY indicates that the caller has not yet populated the chain
with data, and so it is not ready for transmission. This is used by
sendfile (for async I/O) and KTLS (for encryption). In particular, if
pru_send returns an error, the caller is responsible for freeing the
chain since other implicit references to the data buffers exist.

For async sendfile, it happens that an error will only be returned if
the connection was dropped, in which case tcp_usr_ready() will handle
freeing the chain. But since KTLS can be used in conjunction with the
regular socket I/O system calls, many more error cases - which do not
result in the connection being dropped - are reachable. In these cases,
KTLS was effectively assuming success.

So:
- Change sosend_generic() to free the mbuf chain if
pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the
chain at that point.
- Similarly, in vn_sendfile() change the !async I/O && KTLS case to free
the chain.
- If async I/O is still outstanding when pru_send fails in
vn_sendfile(), set an error in the sfio structure so that the
connection is aborted and the mbuf chain is freed.

Reviewed by: gallatin, tuexen
Discussed with: jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30349


# 52a99c72 02-Apr-2021 Mark Johnston <markj@FreeBSD.org>

sendfile: Fix error initialization in sendfile_getobj()

Reviewed by: chs, kib
Reported by: jhb
Fixes: faa998f6ff695
MFC after: 1 day
Differential Revision: https://reviews.freebsd.org/D29540


# faa998f6 25-Feb-2021 Mark Johnston <markj@FreeBSD.org>

sendfile: Use the pager size to determine the file extent when possible

Previously sendfile would issue a VOP_GETATTR and use the returned size,
i.e., the file size. When paging in file data, sendfile_swapin() will
use the pager to determine whether it needs to zero-fill, most often
because of a hole in a sparse file. An attempt to page in beyond the
end of a file is treated this way, and occurs when the requested page is
past the end of the pager. In other words, both the file size and pager
size were used interchangeably.

With ZFS, updates to the pager and file sizes are not synchronized by
the exclusive vnode lock, at least partially due to its use of
MNTK_SHARED_WRITES. In particular, the pager size is updated after the
file size, so in the presence of a writer concurrently extending the
file, sendfile could incorrectly instantiate "holes" in the page cache
pages backing the file, which manifests as data corruption when reading
the file back from the page cache. The on-disk copy is unaffected.

Fix this by consistently using the pager size when available.

Reported by: dumbbell
Reviewed by: chs, kib
Tested by: dumbbell, pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28811


# 214257da 03-Jan-2021 Mark Johnston <markj@FreeBSD.org>

sendfile: Clear page pointers when handling a pager error

When INVARIANTS is configred, the sendfile_iodone() callback verifies
that pages attached to the sendfile header are wired, but we unwire all
such pages after a synchronous pager error, before calling
sendfile_iodone().

Reported by: pho
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 26b23f07 26-Dec-2020 Mark Johnston <markj@FreeBSD.org>

sendfile: Ensure that sfio->npages is initialized

We initialize sfio->npages only when some I/O is required to satisfy the
request. However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.

Fix the problem by initializing sfio->npages earlier. Note that
sendfile_swapin() always initializes the page array. In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.

Reported by: syzkaller (with KASAN)
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27726


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# c2ea3d44 05-Jun-2020 Chuck Silvers <chs@FreeBSD.org>

Fix hang due to missing unbusy in sendfile when an async data I/O fails.

r359473 removed the page unbusy logic from sendfile_iodone() because when
vm_pager_get_pages_async() would return an error after failing to start
the async I/O (eg. because VOP_BMAP failed), sendfile_swapin() would also
unbusy the pages, and it was wrong to unbusy twice. However this breaks
the case where vm_pager_get_pages_async() succeeds in starting an async I/O
and the async I/O is what fails. In this case, sendfile_iodone() must
unbusy the pages, and because sendfile_iodone() doesn't know which case
it is in, sendfile_iodone() must always unbusy pages and relookup pages
which have been substituted with bogus_page, which in turn means that
sendfile_swapin() must never do unbusy or relookup for pages which have
been given to vm_pager_get_pages_async(), even if there is an error.

Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25136


# 61664ee7 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.2: start divorce of M_EXT and M_EXTPG

They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 6edfd179 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.1: mechanically rename M_NOMAP to M_EXTPG

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 7b6c99d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# bccf6e26 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 0c103266 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Continuation of multi page mbuf redesign from r359919.

The following series of patches addresses three things:

Now that array of pages is embedded into mbuf, we no longer need
separate structure to pass around, so struct mbuf_ext_pgs is an
artifact of the first implementation. And struct mbuf_ext_pgs_data
is a crutch to accomodate the main idea r359919 with minimal churn.

Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP.

The namespace for the newfeature is somewhat inconsistent and
sometimes has a lengthy prefixes. In these patches we will
gradually bring the namespace to "m_epg" prefix for all mbuf
fields and most functions.

Step 1 of 4:

o Anonymize mbuf_ext_pgs_data, embed in m_ext
o Embed mbuf_ext_pgs
o Start documenting all this entanglement

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# b7eae758 28-Apr-2020 Mark Johnston <markj@FreeBSD.org>

Make sendfile(SF_SYNC)'s CV wait interruptible.

Otherwise, since the CV is not signalled until data is drained from the
socket, it is trivial to create an unkillable process using
sendfile(SF_SYNC) and a process-private PF_LOCAL socket pair. In
particular, the cv_wait() in sendfile() does not get interrupted until
data is drained from the receiving socket buffer.

Reported by: pho
Discussed with: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# acf65eef 17-Apr-2020 Konstantin Belousov <kib@FreeBSD.org>

sendfile: When all io finished, assert that sfio->pa[] is in expected state.

It must contain fully restored contigous run of the wired pages from
the object, except possible trimmed tail.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e795a040 17-Apr-2020 Konstantin Belousov <kib@FreeBSD.org>

The pa argument for sendfile_iodone() is not necessary a slice of sfio->pa.

It is true for zfs, but it is not for e.g. vnode or buffer pagers.
When fixing bogus pages, fix them in both places. Rely on the fact
that pa[0] must have been invalid so it cannot be bogus.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 23feb563 14-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213


# d25f1b21 11-Apr-2020 Konstantin Belousov <kib@FreeBSD.org>

sendfile_iodone: correct calculation of the page index for relookup.

This is yet another bug in r359473.

Reported and tested by: delphij
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# f709eee6 09-Apr-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not pass bogus page to mbufs.

This is a bug in r359473.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# c506a638 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

kern_sendfile.c: fix bugs with handling of busy page states.

- Do not call into a vnode pager while leaving some pages from the
same block as the current run, xbusy. This immediately deadlocks if
pager needs to instantiate the buffer.
- Only relookup bogus pages after io finished, otherwise we might
obliterate the valid pages by out of date disk content. While there,
expand the comment explaining this pecularity.
- Do not double-unbusy on error. Split unbusy for error case, which
is left in the sendfile_swapin(), from the more properly coded
normal case in sendfile_iodone().
- Add an XXXKIB comment explaining the serious bug in the validation
algorithm, not fixed by this patch series.

PR: 244713
Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# d8663536 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

kern_sendfile.c: do not release sfio reference on error.

It is already done by sendfile_iodone(), now consistently for all errors.
This de-facto reverts r358597, after r359466.

Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# 59e1ac9d 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

kern_sendfile.c: wait for all in-flight ios completion before unwiring pages.

Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# 8f0a223c 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

kern_sendfile.c: add specific malloc type.

Now sfio leaks are more easily seen in the malloc statistics than
e.g. just wired or busy pages leak.

Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# 0ac8511a 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

kern_sendfile.c style: order headers alphabetically.

Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# db4493f7 13-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

sendfile() does currently not support SCTP sockets.
Therefore, fail the call.

Reviewed by: markj@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D24059


# 37bf88e7 03-Mar-2020 Chuck Silvers <chs@FreeBSD.org>

if vm_pager_get_pages_async() returns an error, release the sfio->nios
refcount that we took earlier that represents the I/O that ended up
not being started.

Reviewed by: glebius
Approved by: imp (mentor)
Sponsored by: Netflix


# 6be21eb7 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23866


# 7aaf252c 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Convert a few triviail consumers to the new unlocked grab API.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23847


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 6bc27f08 25-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Generalize resources freeing in sendfile with different scenarios.
Now we execute sendfile_iodone() in all possible cases, which
guarantees that vm_object_pip_wakeup() is called and sfio structure
is freed.

At the beginning of sendfile initialize sfio->m to NULL, that would
indicate that the mbuf chain either doesn't exist, or belongs to the
syscall (not to I/O completion). Fill sfio->m only at a point when
we are positive that there are I/Os ongoing and before releasing
syscall's reference on sfio.

In sendfile_iodone() perform vm_object_pip_wakeup() once last
reference is released, then check for sfio->m. NULL pointer
indicates that we need only to free the memory.

Reviewed by: jtl, gallatin


# f85e1a80 25-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ktls_frame() never fail. Caller must supply correct mbufs.
This makes sendfile code a bit simplier.


# 69302907 25-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

When sendfile_swapin() sweeps through pages in search for a bogus page
skip first and last pages. This is a micro optimisation.


# d3631aa5 05-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Avoid releasing object PIP in vn_sendfile() if no pages were grabbed.

sendfile(2) optionally takes a set of headers that get prepended to the
file data. If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference. This was
introduced in r356902.

Reported by: syzkaller
Sponsored by: The FreeBSD Foundation


# 91e31c3c 22-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Consistently use busy and vm_page_valid() rather than touching page bits
directly. This improves API compliance, asserts, etc.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23283


# d6e13f3b 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't hold the object lock while calling getpages.

The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# b631c36f 24-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Record part of the owner struct thread pointer into busy_lock.

Record as much bits from curthread into busy_lock as fits. Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0). Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler. For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22298


# b8c92303 06-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

If vm_pager_get_pages_async() returns an error synchronously we leak wired
and busy pages. Add code that would carefully cleanups the state in case
of synchronous error return. Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.

In collaboration with: chs
Reviewed by: jtl, kib


# 9e14430d 08-Oct-2019 John Baldwin <jhb@FreeBSD.org>

Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.

This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler. This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE. Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 818d7553 27-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Only define the 'tls' member of sfio in KERN_TLS is defined.

This field was not initialized in the !KERN_TLS case triggering an
assertion failure when using sendfile(2).

Reported by: pho, asomers
Sponsored by: Netflix


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 814f33aa 06-Aug-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Since r350426 this KASSERT doesn't serve any useful purpose.


# 98549e2d 29-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().

Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986


# fca79580 19-Jul-2019 Alan Somers <asomers@FreeBSD.org>

sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error

PR: 236466
Sponsored by: The FreeBSD Foundation


# cec06a3e 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add support for using unmapped mbufs with sendfile(2).

This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# 19c40c76 11-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Trim an extra space.


# 47f1ea51 16-Oct-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Plug sendfile(2) on a listening socket with proper error code.

Reported by: ngie
Reviewed by: ngie
Approved by: re (delphij)


# 147cd40f 29-May-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Revert second chunk of r333860. The warning from gcc is false positive. The
npages won't be ever used in no space case.


# 7fd68414 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

sendfile: annotate unused value and ensure that npages is actually initialized


# cbd92ce6 09-May-2018 Matt Macy <mmacy@FreeBSD.org>

Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month


# 1b5c869d 04-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix some races introduced in r332974.

With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 0eb50f9c 18-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Have vm_page_{deactivate,launder}() requeue already-queued pages.

In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.

Reviewed by: alc
Tested by: pho
MFC after: 2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# b4f55763 05-Jan-2018 Gleb Smirnoff <glebius@FreeBSD.org>

In sendfile_iodone() both pru_abort and sorele need to be executed
with proper VNET context set.

Reported by: sbruno
MFC after: 2 weeks


# 41bf90bb 13-Oct-2017 Alan Cox <alc@FreeBSD.org>

Address two problems with sendfile(..., SF_NOCACHE) and apply one
"optimization". First, sendfile(..., SF_NOCACHE) frees pages without
checking whether those pages are mapped. This can leave the system
with mappings to free or repurposed pages. Second, a page can be
busied between the time of the current busy test and acquiring the
object lock. Essentially, the test performed before the object lock
is acquired can only be regarded as an optimization to short-circuit
further work on the page. It cannot, however, be relied upon to prove
that it is safe to free the page. Third, when sendfile(..., SF_NOCACHE)
was originally implemented, vm_page_deactivate_noreuse() did not yet
exist. Use vm_page_deactivate_noreuse() instead of vm_page_deactivate(),
because it comes closer to freeing the page.

In collaboration with: glebius
Discussed with: gallatin, kib, markj
X-MFC after: r324448


# 1f9916ed 10-Oct-2017 Sean Bruno <sbruno@FreeBSD.org>

match sendfile() error handling to send().

Sendfile() should match the error checking order of send() which
is currently:

SBS_CANTSENDMORE
so_error
SS_ISCONNECTED

Submitted by: Jason Eggleston <jason@eggnet.com>
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12633


# 009ad572 09-Oct-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert r324405 at the request of the submitter pending better solution.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks


# 9c82bec4 09-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Improvements to sendfile(2) mbuf free routine.

o Fall back to default m_ext free mech, using function pointer in
m_ext_free, and remove sf_ext_free() called directly from mbuf code.
Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
SF_SYNC flag. Lack of the flag allows us not to dereference
ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
multi-page mbufs. For now compiler will inline it into
sendfile_free_mext().

In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615


# 75c8dfb6 07-Oct-2017 Sean Bruno <sbruno@FreeBSD.org>

Check so_error early in sendfile() call. Prior to this patch, if a
connection was reset by the remote end, sendfile() would just report
ENOTCONN instead of ECONNRESET.

Submitted by: Jason Eggleston <jason@eggnet.com>
Reviewed by: glebius
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12575


# d37aa3cc 13-Sep-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Use soref() in sendfile(2) instead fhold() to reference a socket.

The problem is that fdrop() requires syscall context, as it may
enter sleep in some cases. The reason to use it in the original
non-blocking sendfile implementation, was to avoid use of global
ACCEPT_LOCK() on every I/O completion. Now in head sorele() no
longer requires this lock.


# af0460be 11-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Have sendfile_swapin() use vm_page_grab_pages().

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11942


# 6921451d 11-Aug-2017 Alan Cox <alc@FreeBSD.org>

An invalid page can't be dirty.

Reviewed by: kib
MFC after: 1 week


# ef3266d5 09-Aug-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Plug uninitialized stack variable leak in sendfile(2).

Reported by: Ilja Van Sprundel <ivansprundel ioactive.com>
Submitted by: Domagoj Stolfa <domagoj.stolfa gmail.com>
MFC after: 1 week
Security: uninitialized stack variable leak


# d712b799 03-Jun-2017 Alan Cox <alc@FreeBSD.org>

The data type returned by vmoff() is too narrow in its range. This could
break the transmission of files longer than 4 GB on 32-bit architectures.

Reviewed by: glebius, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10019


# 9e3c8bd3 24-Mar-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Make sendfile(2) more robust against file change. This fixes a possible
crash when the file shrinks. This also fixes sendfile(2) not sending more
data in a case when the file grows, and the request is open-ended or
specifies a size that is greater than old file size.

PR: 217789
Reviewed by: gallatin
MFC after: 10 days


# bfc8c24c 04-Jan-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Move bogus_page declaration to vm_page.h and initialization to vm_page.c.

Reviewed by: kib


# 00b5ffde 17-Nov-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Add flag SF_USER_READAHEAD to sendfile(2). When specified, the syscall won't
do any speculations about readahead, and use exactly the amount of readahead
specified by user. E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee
that no readahead at all will be performed.


# 5dba303d 17-Nov-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Use bogus_page to properly reduce number of I/Os in sendfile(2). The new
sendfile_swapin() loop works this way:

- Find first invalid page in the request.
- Do vm_pager_has_page() and get count of pages, that can be taken in
single I/O.
- Trim valid pages from the end of the request.
- Cycle through the request and substitute to bogus_page all valid
pages that are in the middle of the request.
- After I/O launched (pager copies array of pages into buf(9), it
is important to restore proper page pointers with help vm_page_lookup().

Count bogus pages used and report them in sendfile stats.


# a2d8f9d2 22-Sep-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression from r297400, which truncates headers in case of low socket
buffer and put a small optimization for low socket buffer case:

- Do not hack uio_resid, and let m_uiotombuf() properly take care of it. This
fixes truncation of headers at low buffer.
- If headers ate all the space, jump right to the end of the cycle, to
avoid doing single page I/O and allocating zero length mbuf.
- Clear hdr_uio only if space is positive, which indicates that all uio
was copied in.

Reviewed by: pluknet, jtl, emax, rrs, lstewart, emax, gallatin, scottl


# 85b0f9de 22-Sep-2016 Mariusz Zaborski <oshogbo@FreeBSD.org>

capsicum: propagate rights on accept(2)

Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# 9c64cfe5 29-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

The sendfile(2) allows to send extra data from userspace before the file
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.

Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix


# 56a5f52e 29-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

New way to manage reference counting of mbuf external storage.

The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with: rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D5396
Sponsored by: Netflix


# 33a2a37b 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Separate sendfile(2) implementation from uipc_syscalls.c into
separate file. Claim my copyright.
- Provide more comments, better function and structure names.
- Sort out unneeded includes from resulting two files.

No functional changes.