History log of /freebsd-current/sys/kern/uipc_ktls.c
Revision Date Author Comments
# b5a9299b 17-Mar-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

ktls: catch invalid parameters earlier

Move safety checks forward from ktls_session_create() to
ktls_copyin_tls_enable(). Prevents zero mallocs, and excessively
large kernel mallocs.

Reported-by: syzbot+72022fa9163fa958b66c@syzkaller.appspotmail.com
Reported-by: syzbot+8992893e13058ce0670a@syzkaller.appspotmail.com
Sponsored by: NetApp, Inc.
X-NetApp-PR: #79
Reviewed By: tuexen
Differential Revision: https://reviews.freebsd.org/D44364


# 85df11a1 12-Mar-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

ktls: deep copy tls_enable struct for in-kernel tcp consumers

Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to be set by kernel threads.

Reviewed By: #transport, tuexen, jhb, rrs
Sponsored by: NetApp, Inc.
X-NetApp-PR: #79
Differential Revision: https://reviews.freebsd.org/D44250


# 0e1d8481 11-Jan-2024 Martin Matuska <mm@FreeBSD.org>

ktls: fix vnet-related panic in ktls_reset_receive_tag()

Reviewed by: gallatin, jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43400


# 2619c5cc 20-Nov-2023 Jason A. Harmening <jah@FreeBSD.org>

Avoid waiting on physical allocations that can't possibly be satisfied

- Change vm_page_reclaim_contig[_domain] to return an errno instead
of a boolean. 0 indicates a successful reclaim, ENOMEM indicates
lack of available memory to reclaim, with any other error (currently
only ERANGE) indicating that reclamation is impossible for the
specified address range. Change all callers to only follow
up with vm_page_wait* in the ENOMEM case.

- Introduce vm_domainset_iter_ignore(), which marks the specified
domain as unavailable for further use by the iterator. Use this
function to ignore domains that can't possibly satisfy a physical
allocation request. Since WAITOK allocations run the iterators
repeatedly, this avoids the possibility of infinitely spinning
in domain iteration if no available domain can satisfy the
allocation request.

PR: 274252
Reported by: kevans
Tested by: kevans
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42706


# 1f8a5187 09-Nov-2023 Alexander Motin <mav@FreeBSD.org>

ktls: Remove unneeded vm/uma_dbg.h include

It was used in original implementation, but is no longer.

MFC after: 2 weeks


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# c721694a 19-Jul-2023 Navdeep Parhar <np@FreeBSD.org>

ktls_alloc_rcv_tag: Fix capability checks for RXTLS4/6.

IFCAP2_* has the bit position and not the shifted value.

Reviewed by: kib@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41100


# 743516d5 17-May-2023 Gleb Smirnoff <glebius@FreeBSD.org>

ktls: don't try to unlock pcb if tcp_drop() already did

Reviewed by: rrs, gallatin


# 19855852 08-May-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: re-work alloc thread

When the ktls_buffer zone needs to expand, it may fail due
to a lack of physically contiguous memory. We tried to rectify
that by introducing an alloc thread to provide a context where
it is harmless to sleep, and letting that thread repopulate
the ktls_buffer zone.

However, it turns out that M_WAITOK is not enough, and we
must call vm_page_reclaim_contig_domain() to reclaim contig
memory. Worse, M_WAITOK results in the allocation essentially
busy-looping around vm_domain_alloc_fail() returning EAGIN,
causing vm_page_alloc_noobj_contig_domain() to loop and resulting
in the alloc thread consuming 100% CPU.

To fix this, we change the alloc thread to call
vm_page_reclaim_contig_domain_ext()

In order to prevent the busy loop around vm_domain_alloc_fail(), we
must change the uma_zalloc flags to M_NORECLAIM | M_NOWAIT. However,
once that is done, these allocations become no different than the
allocations done in the critical path in ktls_buffer_alloc(), so its
best to just eliminate them.

Since we're no longer doing allocations but just calling
vm_page_reclaim_contig_domain_ext(), the name has changed to the ktls
reclaim thread.

Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D39421


# d2dab20c 23-Mar-2023 John Baldwin <jhb@FreeBSD.org>

ktls: Drop all the INET and INET6 compile-time guards.

Consistent with 9fd0d9b16e93ff2a3bd375a98763dca0150dcee0, KERN_TLS is
not supported on kernels without any INET support.

Reviewed by: gallatin, hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39232


# b4b33821 21-Mar-2023 Mark Johnston <markj@FreeBSD.org>

ktls: Fix interlocking between ktls_enable_rx() and listen(2)

The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check
whether the socket is listening socket and fail if so, but this check is
racy. Since we have to lock the socket buffer later anyway, defer the
check to that point.

ktls_enable_tx() locks the send buffer's I/O lock, which will fail if
the socket is a listening socket, so no explicit checks are needed. In
ktls_enable_rx(), which does not acquire the I/O lock (see the review
for some discussion on this), use an explicit SOLISTENING() check after
locking the recv socket buffer.

Otherwise, a concurrent solisten_proto() call can trigger crashes and
memory leaks by wiping out socket buffers as ktls_enable_*() is
modifying them.

Also make sure that a KTLS-enabled socket can't be converted to a
listening socket, and use SOCK_(SEND|RECV)BUF_LOCK macros instead of the
old ones while here.

Add some simple regression tests involving listen(2).

Reported by: syzkaller
MFC after: 2 weeks
Reviewed by: gallatin, glebius, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38504


# 08484627 06-Mar-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ktls: Use IfAPI accessors to get capabilities

Summary:
Avoid referencing the ifnet struct directly, and use the IfAPI accessors
instead.

Reviewed by: gallatin
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38932


# d24b032be 09-Feb-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Fix comments & whitespace issues with c0e4090e3d43

Address some last minute review feedback on c0e4090e3d43
by fixing spacing around comments, and clarifying that the
newly added destroy_task is not related to tls 1.0.
No functional change intended.

Pointed out by: jhb
Sponsored by: Netflix


# c0e4090e 08-Feb-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Accurately track if ifnet ktls is enabled

This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380


# 846e4a20 18-Jan-2023 John Baldwin <jhb@FreeBSD.org>

ktls_disable_ifnet_help: Set curvnet around sorele().

This is required in kernels with VIMAGE such as GENERIC.

MFC after: 1 week
Sponsored by: Chelsio Communications


# 07be7517 27-Dec-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Post receive errors on partially closed sockets.

If an error such as an invalid record or one whose decryption fails is
detected on a socket that has received a RST then ktls_drop() could
ignore the error since INP_DROPPED could already be set. In this case
soreceive_generic hangs since it does not return from a KTLS socket
with pending encrypted data unless there is an error (so_error) (this
behavior is to ensure that soreceive_generic doesn't return a
premature EOF when there is pending data still being decrypted).

Note that this was a bug prior to
69542f26820b7edb8351398b36edda5299c1db56 as tcp_usr_abort would also
have ignored the error in this case.

Reviewed by: gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37775


# 69542f26 15-Dec-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Close a race with setting so_error when dropping a connection.

pr_abort calls tcp_usr_abort which calls tcp_drop with ECONNABORTED.
After pr_abort returns, the so_error is then set to a more specific
error. However, a reader can observe and return the ECONNABORTED
error before so_error is set to the desired error value. This is
resulting in spurious test failures of recently added tests for
invalid conditions such as invalid headers.

To fix, refactor the code to abort a connection to call tcp_drop
directly with the desired error value. ktls_reset_send_tag already
calls tcp_drop directly when it aborts a connection due to an error.

Reviewed by: gallatin
Reported by: CI (jenkins), gallatin, olivier
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37692


# 9a673b71 15-Nov-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Add software support for AES-CBC decryption for TLS 1.1+.

This is mainly intended to provide a fallback for TOE TLS which may
need to use software decryption for an initial record at the start
of a connection.

Reviewed by: markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37370


# 5920f99d 11-Nov-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Inline ktls_cleanup() into ktls_destroy().

Reviewed by: gallatin, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37353


# d01db2b8 11-Nov-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Don't leak ktls session objects for certain errors.

ktls_cleanup() does not free ktls session objects, it merely
cleans (and frees) members of the object.

Change callers to use ktls_free() instead.

Reviewed by: gallatin, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37352


# 8840ae22 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: don't store VNET in every tcpcb, take it from the inpcbinfo

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37125


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# 53af6903 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_TIMEWAIT flag

Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision: https://reviews.freebsd.org/D36400


# 0e391a31 05-Sep-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

ktls: Add missing NULL pointer check for TLS RX hardware offload.

The send tag pointer may be NULL when the ktls_reset_receive_tag()
function is invoked. Add check for this.

Reviewed by: gallatin @
Sponsored by: NVIDIA Networking


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# fe8c78f0 23-Apr-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

ktls: Add full support for TLS RX offloading via network interface.

Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.

During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.

Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.

The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.

Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.

Reviewed by: jhb@ and gallatin@
Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking


# f0fca646 25-May-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

ktls: Refer send tag pointer once.

So that the asserts and the actual code see the same values.

Differential revision: https://reviews.freebsd.org/D32356
MFC after: 1 week
Sponsored by: NVIDIA Networking


# b46667c6 17-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockbuf: merge two versions of sbcreatecontrol() into one

No functional change.


# a4c5d490 22-Apr-2022 John Baldwin <jhb@FreeBSD.org>

KTLS: Move OCF function pointers out of ktls_session.

Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session. This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.

Reviewed by: hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35011


# cd0525f6 11-Feb-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Write-lock the INP when changing a transmit TLS session.

The TCP rate pacing code relies on being able to read this pointer
safely while holding an INP lock. The initial TLS session pointer is
set while holding the write lock already.

Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34086


# 5de79eed 07-Feb-2022 Mark Johnston <markj@FreeBSD.org>

ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode

There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen. Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.

Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.

Add regression tests to exercise this case.

Reported by: syzkaller
Reviewed by: gallatin, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34195


# d958bc79 31-Jan-2022 John Baldwin <jhb@FreeBSD.org>

ktls: Try to enable TOE TLS after marking existing data not ready.

At the moment this is mostly a no-op but in the future there will be
in-flight encrypted data which requires software decryption. This
same setup is also needed for NIC TLS RX.

Note that this does break TOE TLS RX for AES-CBC ciphers since there
is no software fallback for AES-CBC receive. This will be resolved
one way or another before 14.0 is released.

Reviewed by: hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34082


# 9e2cce7e 25-Jan-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

Implement a function to get the next TCP- and TLS- receive sequence number.

This function will be used by coming TLS hardware receive offload support.

Differential Revision: https://reviews.freebsd.org/D32356
Discussed with: jhb@
MFC after: 1 week
Sponsored by: NVIDIA Networking


# 6be8944d 20-Jan-2022 Mark Johnston <markj@FreeBSD.org>

ktls: Zero out TLS_GET_RECORD control messages

Otherwise we end up copying one uninitialized byte into the socket
buffer.

Reported by: KMSAN
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33953


# 05a1d0f5 14-Dec-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Support for TLS 1.3 receive offload.

Note that support for TLS 1.3 receive offload in OpenSSL is still an
open pull request in active development. However, potential changes
to that pull request should not affect the kernel interface.

Reviewed by: hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33007


# a90b85dd 14-Dec-2021 Mateusz Guzik <mjg@FreeBSD.org>

ktls: plug set-but-not-used vars

Sponsored by: Rubicon Communications, LLC ("Netgate")


# db0ac6de 02-Dec-2021 Cy Schubert <cy@FreeBSD.org>

Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"

This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing
changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.


# de2d4784 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

SMR protection for inpcbs

With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022


# 900a28fe 15-Nov-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Reject some invalid cipher suites.

- Reject AES-CBC cipher suites for TLS 1.0 and TLS 1.1 using auth
algorithms other than SHA1-HMAC.

- Reject AES-GCM cipher suites for TLS versions older than 1.2.

Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32842


# e3ba94d4 09-Nov-2021 John Baldwin <jhb@FreeBSD.org>

Don't require the socket lock for sorele().

Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741


# 96668a81 21-Oct-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Always create a software backend for receive sessions.

A future change to TOE TLS will require a software fallback for the
first few TLS records received. Future support for NIC TLS on receive
will also require a software fallback for certain cases.

Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32566


# c57dbec6 21-Oct-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Add a routine to query information in a receive socket buffer.

In particular, ktls_pending_rx_info() determines which TLS record is
at the end of the current receive socket buffer (including
not-yet-decrypted data) along with how much data in that TLS record is
not yet present in the socket buffer.

This is useful for future changes to support NIC TLS receive offload
and enhancements to TOE TLS receive offload. Those use cases need a
way to synchronize a state machine on the NIC with the TLS record
boundaries in the TCP stream.

Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32564


# 84c39222 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert consumers to vm_page_alloc_noobj_contig()

Remove now-unneeded page zeroing. No functional change intended.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32006


# a4667e09 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert vm_page_alloc() callers to use vm_page_alloc_noobj().

Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986


# a72ee355 14-Oct-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Defer creation of threads and zones until first use.

Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot. This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.

Reviewed by: gallatin, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32487


# 9f03d2c0 13-Oct-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Ensure FIFO encryption order for TLS 1.0.

TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.

If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.

To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.

- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.

If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.

- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.

Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32381


# a63752cc 13-Oct-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Reject attempts to enable AES-CBC with TLS 1.3.

AES-CBC cipher suites are not supported in TLS 1.3.

Reported by: syzbot+ab501c50033ec01d53c6@syzkaller.appspotmail.com
Reviewed by: tuexen, markj
Differential Revision: https://reviews.freebsd.org/D32404


# bf256782 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers

ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden. Fix this by using a separate output
parameter. Convert to the new socket buffer locking macros while here.

Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31977


# c782ea8b 14-Sep-2021 John Baldwin <jhb@FreeBSD.org>

Add a switch structure for send tags.

Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'. A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions. Now, those type-specific functions are saved
in the switch structure and invoked directly. In addition, this more
gracefully permits multiple implementations of the same tag within a
driver. In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by: gallatin, hselasky, kib (older version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31572


# f94acf52 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Rename sb(un)lock() and interlock with listen(2)

In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().

Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31657


# 470e851c 30-Aug-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Support asynchronous dispatch of AEAD ciphers.

KTLS OCF support was originally targeted at software backends that
used host CPU cycles to encrypt TLS records. As a result, each KTLS
worker thread queued a single TLS record at a time and waited for it
to be encrypted before processing another TLS record. This works well
for software backends but limits throughput on OCF drivers for
coprocessors that support asynchronous operation such as qat(4) or
ccr(4). This change uses an alternate function (ktls_encrypt_async)
when encrypt TLS records via a coprocessor. This function queues TLS
records for encryption and returns. It defers the work done after a
TLS record has been encrypted (such as marking the mbufs ready) to a
callback invoked asynchronously by the coprocessor driver when a
record has been encrypted.

- Add a struct ktls_ocf_state that holds the per-request state stored
on the stack for synchronous requests. Asynchronous requests malloc
this structure while synchronous requests continue to allocate this
structure on the stack.

- Add a ktls_encrypt_async() variant of ktls_encrypt() which does not
perform request completion after dispatching a request to OCF.
Instead, the ktls_ocf backends invoke ktls_encrypt_cb() when a TLS
record request completes for an asynchronous request.

- Flag AEAD software TLS sessions as async if the backend driver
selected by OCF is an async driver.

- Pull code to create and dispatch an OCF request out of
ktls_encrypt() into a new ktls_encrypt_one() function used by both
ktls_encrypt() and ktls_encrypt_async().

- Pull code to "finish" the VM page shuffling for a file-backed TLS
record into a helper function ktls_finish_noanon() used by both
ktls_encrypt() and ktls_encrypt_cb().

Reviewed by: markj
Tested on: ccr(4) (jhb), qat(4) (markj)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31665


# d16cb228 16-Aug-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Fix accounting for TLS 1.0 empty fragments.

TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is
zero. However, these mbufs need to occupy a "unit" of space for the
purposes of M_NOTREADY tracking similar to regular mbufs. Previously
this was done for the page count returned from ktls_frame() and passed
to ktls_enqueue() as well as the page count passed to pru_ready().

However, sbready() and mb_free_notready() only use m_epg_nrdy to
determine the number of "units" of space in an M_EXT mbuf, so when a
TLS 1.0 fragment was marked ready it would mark one unit of the next
mbuf in the socket buffer as ready as well. To fix, set m_epg_nrdy to
1 for empty fragments. This actually simplifies the code as now only
ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest
of the KTLS functions can just use m_epg_nrdy.

Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31536


# 95c51faf 11-Aug-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Init reset tag task for cloned sessions

When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.

Reviewed by: jhb
Sponsored by: Netflix


# 09066b98 05-Aug-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Use the new PNOLOCK flag

Use the new PNOLOCK flag to tsleep() to indicate that
we are managing potential races, and don't need to
sleep with a lock, or have a backstop timeout.

Reviewed by: jhb
Sponsored by: Netflix


# 2694c869 05-Aug-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: fix a panic with INVARIANTS

98215005b747fef67f44794ca64abd473b98bade introduced a new
thread that uses tsleep(..0) to sleep forever. This hit
an assert due to sleeping with a 0 timeout.

So spell "forever" using SBT_MAX instead, which does not
trigger the assert.

Pointy hat to: gallatin
Pointed out by: emaste
Sponsored by: Netflix


# 98215005 05-Aug-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: start a thread to keep the 16k ktls buffer zone populated

Ktls recently received an optimization where we allocate 16k
physically contiguous crypto destination buffers. This provides a
large (more than 5%) reduction in CPU use in our
workload. However, after several days of uptime, the performance
benefit disappears because we have frequent allocation failures
from the ktls buffer zone.

It turns out that when load drops off, the ktls buffer zone is
trimmed, and some 16k buffers are freed back to the OS. When load
picks back up again, re-allocating those 16k buffers fails after
some number of days of uptime because physical memory has become
fragmented. This causes allocations to fail, because they are
intentionally done without M_NORECLAIM, so as to avoid pausing
the ktls crytpo work thread while the VM system defragments
memory.

To work around this, this change starts one thread per VM domain
to allocate ktls buffers with M_NORECLAIM, as we don't care if
this thread is paused while memory is defragged. The thread then
frees the buffers back into the ktls buffer zone, thus allowing
future allocations to succeed.

Note that waking up the thread is intentionally racy, but neither
of the races really matter. In the worst case, we could have
either spurious wakeups or we could have to wait 1 second until
the next rate-limited allocation failure to wake up the thread.

This patch has been in use at Netflix on a handful of servers,
and seems to fix the issue.

Differential Revision: https://reviews.freebsd.org/D31260
Reviewed by: jhb, markj, (jtl, rrs, and dhw reviewed earlier version)
Sponsored by: Netflix


# 4150a5a8 07-Jul-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: fix NOINET build

Reported by: mjguzik
Sponsored by: Netflix


# 28d0a740 06-Jul-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: auto-disable ifnet (inline hw) kTLS

Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.

This breaks down when transmits are out of order (eg,
TCP retransmits). In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted. This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.

This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.

Reviewed by: hselasky, markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30908


# 904a08f3 30-Jun-2021 Mateusz Guzik <mjg@FreeBSD.org>

ktls: switch bare zone_mbuf use to m_free_raw

Reviewed by: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30955


# faf0224f 15-Jun-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Don't mark existing received mbufs notready for TOE TLS.

The TOE driver might receive decrypted TLS records that are enqueued
to the socket buffer after ktls_try_toe() returns and before
ktls_enable_rx() locks the receive buffer to call sb_mark_notready().
In that case, sb_mark_notready() would incorrectly treat the decrypted
TLS record as an encrypted record and schedule it for decryption.
This always resulted in the connection being dropped as the data in
the control message did not look like a valid TLS header.

To fix, don't try to handle software decryption of existing buffers in
the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS
driver needs to decrypt existing data in the socket buffer, the driver
will need to manage that in its tod_alloc_tls_session method.

Sponsored by: Chelsio Communications


# ed5e13cf 14-Jun-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Fix interaction with RATELIMIT

uipc_ktls.c was missing opt_ratelimit.h, so it was
never noticing that RATELIMIT was enabled. Once it was
enabled, it failed to compile as ktls_modify_txrtlmt()
had accrued a compilation error when it was not being
compiled in.

Sponsored by: Netflix


# 6b313a3a 25-May-2021 John Baldwin <jhb@FreeBSD.org>

Include the trailer in the original dst_iov.

This avoids creating a duplicate copy on the stack just to
append the trailer.

Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30139


# 21e3c1fb 25-May-2021 John Baldwin <jhb@FreeBSD.org>

Assume OCF is the only KTLS software backend.

This removes support for loadable software backends. The KTLS OCF
support is now always included in kernels with KERN_TLS and the
ktls_ocf.ko module has been removed. The software encryption routines
now take an mbuf directly and use the TLS mbuf as the crypto buffer
when possible.

Bump __FreeBSD_version for software backends in ports.

Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30138


# 89b65087 05-Mar-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Hide initialization message behind bootverbose

We don't typically print anything when a subsystem initializes itself,
and KTLS is currently disabled by default anyway.

Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29097


# 49f6925c 03-Mar-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Cache output buffers for software encryption

Maintain a cache of physically contiguous runs of pages for use as
output buffers when software encryption is configured and in-place
encryption is not possible. This makes allocation and free cheaper
since in the common case we avoid touching the vm_page structures for
the buffer, and fewer calls into UMA are needed. gallatin@ reports a
~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after
this change.

It is possible that we will not be able to allocate these buffers if
physical memory is fragmented. To avoid frequently calling into the
physical memory allocator in this scenario, rate-limit allocation
attempts after a failure. In the failure case we fall back to the old
behaviour of allocating a page at a time.

N.B.: this scheme could be simplified, either by simply using malloc()
and looking up the PAs of the pages backing the buffer, or by falling
back to page by page allocation and creating a mapping in the cache
zone. This requires some way to save a mapping of an M_EXTPG page array
in the mbuf, though. m_data is not really appropriate. The second
approach may be possible by saving the mapping in the plinks union of
the first vm_page structure of the array, but this would force a vm_page
access when freeing an mbuf.

Reviewed by: gallatin, jhb
Tested by: gallatin
Sponsored by: Ampere Computing
Submitted by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D28556


# 90972f04 19-Feb-2021 John Baldwin <jhb@FreeBSD.org>

ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter.

I missed updating this counter when rebasing the changes in
9c64fc40290e08f6dc6b75aa04084b04e48a61af after the switch to
COUNTER_U64_DEFINE_EARLY in 1755b2b9891bb1bfa7a58383ef5126821f7e46e3.

Fixes: 9c64fc40290e Add Chacha20-Poly1305 as a KTLS cipher suite.
Sponsored by: Netflix


# 9c64fc40 18-Feb-2021 John Baldwin <jhb@FreeBSD.org>

Add Chacha20-Poly1305 as a KTLS cipher suite.

Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446). For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.

Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27839


# b5aa9ad4 08-Feb-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Make configuration sysctls available as tunables

Reviewed by: gallatin, jhb
Sponsored by: Ampere Computing
Submitted by: Klara, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28499


# 1755b2b9 08-Feb-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Use COUNTER_U64_DEFINE_EARLY

This makes it a bit more straightforward to add new counters when
debugging. No functional change intended.

Reviewed by: jhb
Sponsored by: Ampere Computing
Submitted by: Klara, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28498


# 3f43ada9 28-Jan-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.

Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.


# 4dc1b17d 19-Jan-2021 Mark Johnston <markj@FreeBSD.org>

ktls: Improve handling of the bind_threads tunable a bit

- Only check for empty domains if we actually tried to configure domain
affinity in the first place. Otherwise setting bind_threads=1 will
always cause the sysctl value to be reported as zero. This is
harmless since the threads end up being bound, but it's confusing.
- Try to improve the sysctl description a bit.

Reviewed by: gallatin, jhb
Submitted by: Klara, Inc.
Sponsored by: Ampere Computing
Differential Revision: https://reviews.freebsd.org/D28161


# 6685e259 08-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: don't use KTLS socket option on listening sockets

KTLS socket options make use of socket buffers, which are not
available for listening sockets.

Reported by: syzbot+a8829e888a93a4a04619@syzkaller.appspotmail.com
Reviewed by: jhb@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D27948


# 02bc3865 19-Dec-2020 Andrew Gallatin <gallatin@FreeBSD.org>

Optionally bind ktls threads to NUMA domains

When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.

This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.

Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21648


# 36e0a362 29-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().

This gives a more uniform API for send tag life cycle management.

Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000


# 521eac97 28-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Support hardware rate limiting (pacing) with TLS offload.

- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691


# 6bcf3c46 19-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Check TF_TOE not the tod pointer to determine if TOE is active.

The TF_TOE flag is the check used in the rest of the network stack to
determine if TOE is active on a socket. There is at least one path in
the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a
socket not using TOE.

Reported by: Sony Arpita Das <sonyarpitad@chelsio.com>
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26803


# c2a8fd6f 13-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Permit sending empty fragments for TLS 1.0.

Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically
send empty TLS records ("empty fragments"). These TLS records have no
payload (and thus a page count of zero). m_uiotombuf_nomap() was
returning NULL instead of an empty mbuf, and a few places needed to be
updated to treat an empty TLS record as having a page count of "1" as
0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark
ready via sbready()).

Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26729


# d29a3de2 04-Sep-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

uipc_ktls: remove unused static function

m_segments() was added with r363464 but never used. Remove it to
avoid warnings when compiling kernels.

Reported by: rmacklem (also says jhb)
Reviewed by: gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D26330


# 9675d889 04-Sep-2020 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Check for a NULL send tag in ktls_cleanup()

When using ifnet ktls, and when ktls_reset_send_tag()
fails to allocate a replacement tag, it leaves
the tls session's snd_tag pointer NULL. ktls_cleanup()
tries to release the send tag, and will trip over
this NULL pointer and panic unless NULL is checked for.

Reviewed by: jhb
Sponsored by: Netflix


# 3c0e5685 23-Jul-2020 John Baldwin <jhb@FreeBSD.org>

Add support for KTLS RX via software decryption.

Allow TLS records to be decrypted in the kernel after being received
by a NIC. At a high level this is somewhat similar to software KTLS
for the transmit path except in reverse. Protocols enqueue mbufs
containing encrypted TLS records (or portions of records) into the
tail of a socket buffer and the KTLS layer decrypts those records
before returning them to userland applications. However, there is an
important difference:

- In the transmit case, the socket buffer is always a single "record"
holding a chain of mbufs. Not-yet-encrypted mbufs are marked not
ready (M_NOTREADY) and released to protocols for transmit by marking
mbufs ready once their data is encrypted.

- In the receive case, incoming (encrypted) data appended to the
socket buffer is still a single stream of data from the protocol,
but decrypted TLS records are stored as separate records in the
socket buffer and read individually via recvmsg().

Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

- Socket buffers now contain an additional chain of mbufs (sb_mtls,
sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by
the protocol layer. These mbufs are still marked M_NOTREADY, but
soreceive*() generally don't know about them (except that they will
block waiting for data to be decrypted for a blocking read).

- Each time a new mbuf is appended to this TLS mbuf chain, the socket
buffer peeks at the TLS record header at the head of the chain to
determine the encrypted record's length. If enough data is queued
for the TLS record, the socket is placed on a per-CPU TLS workqueue
(reusing the existing KTLS workqueues and worker threads).

- The worker thread loops over the TLS mbuf chain decrypting records
until it runs out of data. Each record is detached from the TLS
mbuf chain while it is being decrypted to keep the mbufs "pinned".
However, a new sb_dtlscc field tracks the character count of the
detached record and sbcut()/sbdrop() is updated to account for the
detached record. After the record is decrypted, the worker thread
first checks to see if sbcut() dropped the record. If so, it is
freed (can happen when a socket is closed with pending data).
Otherwise, the header and trailer are stripped from the original
mbufs, a control message is created holding the decrypted TLS
header, and the decrypted TLS record is appended to the "normal"
socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)

Tested by: rmacklem (various versions)
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24628


# 4a711b8d 25-Jun-2020 John Baldwin <jhb@FreeBSD.org>

Use zfree() instead of explicit_bzero() and free().

In addition to reducing lines of code, this also ensures that the full
allocation is always zeroed avoiding possible bugs with incorrect
lengths passed to explicit_bzero().

Suggested by: cem
Reviewed by: cem, delphij
Approved by: csprng (cem)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25435


# 4f3c0f3d 26-May-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix build issue after r360292 when using both RSS and KERN_TLS options.

Sponsored by: Mellanox Technologies


# 6edfd179 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.1: mechanically rename M_NOMAP to M_EXTPG

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 7b6c99d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# bccf6e26 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# c4ee38f8 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.3: Rename mbuf_ext_pg_len() to m_epg_pagelen() that
uses mbuf argument.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# d90fe9d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.1: Build TLS workqueue from mbufs, not struct mbuf_ext_pgs.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# eeec8348 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Get rid of the mbuf self-pointing pointer.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 7433a5a9 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Start moving into EPG_/epg_ namespace. There is only one flag, but
next commit brings in second flag, so let them already be in the
future namespace.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 0c103266 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Continuation of multi page mbuf redesign from r359919.

The following series of patches addresses three things:

Now that array of pages is embedded into mbuf, we no longer need
separate structure to pass around, so struct mbuf_ext_pgs is an
artifact of the first implementation. And struct mbuf_ext_pgs_data
is a crutch to accomodate the main idea r359919 with minimal churn.

Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP.

The namespace for the newfeature is somewhat inconsistent and
sometimes has a lengthy prefixes. In these patches we will
gradually bring the namespace to "m_epg" prefix for all mbuf
fields and most functions.

Step 1 of 4:

o Anonymize mbuf_ext_pgs_data, embed in m_ext
o Embed mbuf_ext_pgs
o Start documenting all this entanglement

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# f1f93475 27-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Initial support for kernel offload of TLS receive.

- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
authentication algorithms and keys as well as the initial sequence
number.

- When reading from a socket using KTLS receive, applications must use
recvmsg(). Each successful call to recvmsg() will return a single
TLS record. A new TCP control message, TLS_GET_RECORD, will contain
the TLS record header of the decrypted record. The regular message
buffer passed to recvmsg() will receive the decrypted payload. This
is similar to the interface used by Linux's KTLS RX except that
Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
1.2, though I believe we will be able to reuse the same interface
and structures for 1.3.


# ec1db6e1 27-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Add the initial sequence number to the TLS enable socket option.

This will be needed for KTLS RX.

Reviewed by: gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24451


# 454d3896 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix LINT build #2 after r360292.

Pointyhat to: melifaro


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# 23feb563 14-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213


# c0341432 27-Mar-2020 John Baldwin <jhb@FreeBSD.org>

Refactor driver and consumer interfaces for OCF (in-kernel crypto).

- The linked list of cryptoini structures used in session
initialization is replaced with a new flat structure: struct
crypto_session_params. This session includes a new mode to define
how the other fields should be interpreted. Available modes
include:

- COMPRESS (for compression/decompression)
- CIPHER (for simply encryption/decryption)
- DIGEST (computing and verifying digests)
- AEAD (combined auth and encryption such as AES-GCM and AES-CCM)
- ETA (combined auth and encryption using encrypt-then-authenticate)

Additional modes could be added in the future (e.g. if we wanted to
support TLS MtE for AES-CBC in the kernel we could add a new mode
for that. TLS modes might also affect how AAD is interpreted, etc.)

The flat structure also includes the key lengths and algorithms as
before. However, code doesn't have to walk the linked list and
switch on the algorithm to determine which key is the auth key vs
encryption key. The 'csp_auth_*' fields are always used for auth
keys and settings and 'csp_cipher_*' for cipher. (Compression
algorithms are stored in csp_cipher_alg.)

- Drivers no longer register a list of supported algorithms. This
doesn't quite work when you factor in modes (e.g. a driver might
support both AES-CBC and SHA2-256-HMAC separately but not combined
for ETA). Instead, a new 'crypto_probesession' method has been
added to the kobj interface for symmteric crypto drivers. This
method returns a negative value on success (similar to how
device_probe works) and the crypto framework uses this value to pick
the "best" driver. There are three constants for hardware
(e.g. ccr), accelerated software (e.g. aesni), and plain software
(cryptosoft) that give preference in that order. One effect of this
is that if you request only hardware when creating a new session,
you will no longer get a session using accelerated software.
Another effect is that the default setting to disallow software
crypto via /dev/crypto now disables accelerated software.

Once a driver is chosen, 'crypto_newsession' is invoked as before.

- Crypto operations are now solely described by the flat 'cryptop'
structure. The linked list of descriptors has been removed.

A separate enum has been added to describe the type of data buffer
in use instead of using CRYPTO_F_* flags to make it easier to add
more types in the future if needed (e.g. wired userspace buffers for
zero-copy). It will also make it easier to re-introduce separate
input and output buffers (in-kernel TLS would benefit from this).

Try to make the flags related to IV handling less insane:

- CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv'
member of the operation structure. If this flag is not set, the
IV is stored in the data buffer at the 'crp_iv_start' offset.

- CRYPTO_F_IV_GENERATE means that a random IV should be generated
and stored into the data buffer. This cannot be used with
CRYPTO_F_IV_SEPARATE.

If a consumer wants to deal with explicit vs implicit IVs, etc. it
can always generate the IV however it needs and store partial IVs in
the buffer and the full IV/nonce in crp_iv and set
CRYPTO_F_IV_SEPARATE.

The layout of the buffer is now described via fields in cryptop.
crp_aad_start and crp_aad_length define the boundaries of any AAD.
Previously with GCM and CCM you defined an auth crd with this range,
but for ETA your auth crd had to span both the AAD and plaintext
(and they had to be adjacent).

crp_payload_start and crp_payload_length define the boundaries of
the plaintext/ciphertext. Modes that only do a single operation
(COMPRESS, CIPHER, DIGEST) should only use this region and leave the
AAD region empty.

If a digest is present (or should be generated), it's starting
location is marked by crp_digest_start.

Instead of using the CRD_F_ENCRYPT flag to determine the direction
of the operation, cryptop now includes an 'op' field defining the
operation to perform. For digests I've added a new VERIFY digest
mode which assumes a digest is present in the input and fails the
request with EBADMSG if it doesn't match the internally-computed
digest. GCM and CCM already assumed this, and the new AEAD mode
requires this for decryption. The new ETA mode now also requires
this for decryption, so IPsec and GELI no longer do their own
authentication verification. Simple DIGEST operations can also do
this, though there are no in-tree consumers.

To eventually support some refcounting to close races, the session
cookie is now passed to crypto_getop() and clients should no longer
set crp_sesssion directly.

- Assymteric crypto operation structures should be allocated via
crypto_getkreq() and freed via crypto_freekreq(). This permits the
crypto layer to track open asym requests and close races with a
driver trying to unregister while asym requests are in flight.

- crypto_copyback, crypto_copydata, crypto_apply, and
crypto_contiguous_subsegment now accept the 'crp' object as the
first parameter instead of individual members. This makes it easier
to deal with different buffer types in the future as well as
separate input and output buffers. It's also simpler for driver
writers to use.

- bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer.
This understands the various types of buffers so that drivers that
use DMA do not have to be aware of different buffer types.

- Helper routines now exist to build an auth context for HMAC IPAD
and OPAD. This reduces some duplicated work among drivers.

- Key buffers are now treated as const throughout the framework and in
device drivers. However, session key buffers provided when a session
is created are expected to remain alive for the duration of the
session.

- GCM and CCM sessions now only specify a cipher algorithm and a cipher
key. The redundant auth information is not needed or used.

- For cryptosoft, split up the code a bit such that the 'process'
callback now invokes a function pointer in the session. This
function pointer is set based on the mode (in effect) though it
simplifies a few edge cases that would otherwise be in the switch in
'process'.

It does split up GCM vs CCM which I think is more readable even if there
is some duplication.

- I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC
as an auth algorithm and updated cryptocheck to work with it.

- Combined cipher and auth sessions via /dev/crypto now always use ETA
mode. The COP_F_CIPHER_FIRST flag is now a no-op that is ignored.
This was actually documented as being true in crypto(4) before, but
the code had not implemented this before I added the CIPHER_FIRST
flag.

- I have not yet updated /dev/crypto to be aware of explicit modes for
sessions. I will probably do that at some point in the future as well
as teach it about IV/nonce and tag lengths for AEAD so we can support
all of the NIST KAT tests for GCM and CCM.

- I've split up the exising crypto.9 manpage into several pages
of which many are written from scratch.

- I have converted all drivers and consumers in the tree and verified
that they compile, but I have not tested all of them. I have tested
the following drivers:

- cryptosoft
- aesni (AES only)
- blake2
- ccr

and the following consumers:

- cryptodev
- IPsec
- ktls_ocf
- GELI (lightly)

I have not tested the following:

- ccp
- aesni with sha
- hifn
- kgssapi_krb5
- ubsec
- padlock
- safe
- armv8_crypto (aarch64)
- glxsb (i386)
- sec (ppc)
- cesa (armv7)
- cryptocteon (mips64)
- nlmsec (mips64)

Discussed with: cem
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D23677


# 98085bae 09-Mar-2020 Andrew Gallatin <gallatin@FreeBSD.org>

make lacp's use_numa hashing aware of send tags

When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
always called by lacp_select_tx_port()

- pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

- adds a numa_domain field to if_snd_tag_alloc_params

- plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by: rrs, hselasky, jhb
Sponsored by: Netflix


# a2fba2a7 03-Mar-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

upic_ktrls: make RSS compile again here

The results of ktls_get_cpu() are stored in u_int and NETISR_CPUID_NONE
requires u_int. Adjust uint16_t to uint_t in order to make RSS kernels
compile some more again.

HPTS still has to be fixed, which is a bit more complicated.

Reviewed by: jhb, gallatin, rrs
Differential Revision: https://reviews.freebsd.org/D23726


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# f85e1a80 25-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ktls_frame() never fail. Caller must supply correct mbufs.
This makes sendfile code a bit simplier.


# 1f69a509 21-Jan-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Make sure the VNET is properly set when calling tcp_drop() from
the ktls taskqueue callback function.

A valid VNET is needed when updating statistics.

panic()
tcp_state_change()
tcp_drop()
ktls_reset_send_tag()
taskqueue_run_locked()
taskqueue_thread_loop()

Sponsored by: Mellanox Technologies


# 90746943 14-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Since this code uses if_ref()/if_rele() it must include if_var.h
explicitly, not via header pollution.


# 815db2f6 28-Nov-2019 Ryan Libby <rlibby@FreeBSD.org>

ktls_session zone: don't need to specify uma trash

The use of the uma trash procedures is automatic, there's no need to
pass them explicitly here.

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22582


# 1a496125 06-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions. No
functional change here.


# 7d29eb9a 24-Oct-2019 John Baldwin <jhb@FreeBSD.org>

Use a counter with a random base for explicit IVs in GCM.

This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq(). This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.

Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22117


# 9e14430d 08-Oct-2019 John Baldwin <jhb@FreeBSD.org>

Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.

This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler. This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE. Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891


# b2dba663 27-Sep-2019 Andrew Gallatin <gallatin@FreeBSD.org>

kTLS: Fix a bug where we would not encrypt anon data inplace.

Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.

When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.

This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.

Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21796


# 6554362c 27-Sep-2019 Andrew Gallatin <gallatin@FreeBSD.org>

kTLS support for TLS 1.3

TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by: jhb, hselasky
Sponsored by: Netflix
Differential Revision: D21801


# 61b8a4af 20-Sep-2019 Andrew Gallatin <gallatin@FreeBSD.org>

remove redundant "ktls" in KTLS thr name

This reducesthe string width of the ktls thread name
and improves "ps" output.

Glanced at by: jhb
Event: EuroBSDCon hackathon
Sponsored by: Netflix


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277