History log of /freebsd-current/sys/net/iflib.c
Revision Date Author Comments
# ed34a6b6 17-Jan-2023 Eric Joyner <erj@FreeBSD.org>

iflib: Add subinterface interrupt allocation function

The ice(4) driver will add the ability to create extra interfaces
that hang off of the base interface; to do that the driver requires
a method for the subinterface to request hardware interrupt resources
from the base interface.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D39930


# 3c7da27a 22-Mar-2023 Eric Joyner <erj@FreeBSD.org>

iflib: Add sysctl to request extra MSIX vectors on driver load

Intended to be used with upcoming feature to add sub-interfaces, since
those new interfaces will be dynamically created and will need to have
spare MSI-X interrupts already allocated for them on driver load.

This sysctl is marked as a tunable since it will need to be set before
the driver is loaded since MSI-X interrupt allocation and setup is
done during the attach process.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D41326


# e4a0c92e 16-Apr-2024 Stephen J. Kiernan <stevek@FreeBSD.org>

iflib: Correct indentation according to style(9)

The indentation style for the SYSCTL_* macros used was not matching KNF.

Reported by: jhb
Differential Revision: https://reviews.freebsd.org/D44811


# 303dea74 03-Apr-2024 Stephen J. Kiernan <stevek@FreeBSD.org>

iflib: Fix compiler warnings

Some of the QUAD sysctls are actually for unsigned quad values.
Switch to using UQUAD instead, as that is meant for unsigned.

Reviewed by: erj, jhb
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D44620


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# d2dd3d5a 04-Aug-2023 Eric Joyner <erj@FreeBSD.org>

iflib: Remove redundant variable

In iflib_init_locked(), sctx and scctx both point to the same value,
which is the ifc_softc_ctx field in the iflib softc. Remove the
declaration and assignment to sctx since scctx can be used instead, and
the name of scctx follows the naming convention used for local variables
that point to ifc_softc_ctx.

In theory there should be no functional impact with this change.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

Reviewed by: kbowling@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D41325


# 7f527d48 04-Aug-2023 Eric Joyner <erj@FreeBSD.org>

iflib: Fix white space and reduce some line lengths

This helps align some of the code with the rest of the style used in
iflib, but as marius@ points out, this is not style(9).

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

Reviewed by: kbowling@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D41324


# 7ff9ae90 03-Aug-2023 Marius Strobl <marius@FreeBSD.org>

iflib(9): Remove support for cloning pseudo interfaces

This code was used by the first incarnation of wg(4) and is dead ever
since f187d6dfbf633665ba6740fe22742aec60ce02a2 has removed the latter
again. Moreover, this code matched iflib(4) like a square peg fits in
a round hole, was incomplete and despite some hacks still tailored to
VPC and wg(4) but not generic. In effect, this reverts the following:
09f6ff4f1a47c3009dc16fdc609a44f2341bc7ac (w/ its "ancillary changes")
9aeca21324f481f57f2ecb7009f461f4f51b62b3
1f93e931d9f0c688f43f98ef777e04636a325526
0f9544d03e89d180f94a7a84b110ec7d2b6c625a
0dd691b41276ce13d25ffb1443af27f85038aa3f

Reviewed by: erj, kbowling
Differential Revision: <https://reviews.freebsd.org/D41196>


# 04d4e345 27-Jul-2023 Przemyslaw Lewandowski <przemyslawx.lewandowski@intel.com>

iflib: Fix panic during driver reload stress test

During a driver reload stress test, after 50-300 reloads a panic occurs.
After adding sleeps in between loading and unloading the driver, the
issue does not occur. It's possible that loading/unloading too fast may
cause the gt_taskqueue pointer to be freed earlier than expected;
checking for a null pointer first fixes it.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

Reviewed by: erj@
Tested by: jeffrey.e.pieper@intel.com
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D39457


# a52f23f4 19-Jul-2023 Eric Joyner <erj@FreeBSD.org>

iflib: Unlock ctx lock around call to ether_ifattach()

Panic occurs during loading driver using kldload. It exists since netlink is
enabled. There is problem with double locking ctx. This fix allows to call
ether_ifattach() without locked ctx.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

PR: 271768
Reviewed by: erj@, jhb@
MFC after: 1 day
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D40557


# a6b55ee6 17-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

net: replace IFF_KNOWSEPOCH with IFF_NEEDSEPOCH

Expect that drivers call into the network stack with the net epoch
entered. This has already been the fact since early 2020. The net
interrupts, that are marked with INTR_TYPE_NET, were entering epoch
since 511d1afb6bf. For the taskqueues there is NET_TASK_INIT() and
all drivers that were known back in 2020 we marked with it in
6c3e93cb5a4. However in e87c4940156 we took conservative approach
and preferred to opt-in rather than opt-out for the epoch.

This change not only reverts e87c4940156 but adds a safety belt to
avoid panicing with INVARIANTS if there is a missed driver. With
INVARIANTS we will run in_epoch() check, print a warning and enter
the net epoch. A driver that prints can be quickly fixed with the
IFF_NEEDSEPOCH flag, but better be augmented to properly enter the
epoch itself.

Note on TCP LRO: it is a backdoor to enter the TCP stack bypassing
some layers of net stack, ignoring either old IFF_KNOWSEPOCH or the
new IFF_NEEDSEPOCH. But the tcp_lro_flush_all() asserts the presence
of network epoch. Indeed, all NIC drivers that support LRO already
provide the epoch, either with help of INTR_TYPE_NET or just running
NET_EPOCH_ENTER() in their code.

Reviewed by: zlei, gallatin, erj
Differential Revision: https://reviews.freebsd.org/D39510


# 25c92cd2 06-Mar-2023 Justin Hibbits <jhibbits@FreeBSD.org>

iflib: Further convert to use IfAPI accessors

Summary:
When iflib was first converted some IfAPI APIs were not yet present, so
were tagged with "XXX" comments. Finish the conversion by using these
new APIs.

Reviewed by: gallatin, erj
Sponsored by: Juniper Networks, Inc
Differential Revision: https://reviews.freebsd.org/D38928


# 5f7bea29 28-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

iflib: fix regression with new pfil(9) KPI

Do not pass the pointer to our valid mbuf to pfil(9). Pass an
uninitialized one only. This was unsafe with the old KPI, too,
but for some reason didn't fail.

Fixes: caf32b260ad46b17a4c1a8ce6383e37ac489f023


# caf32b26 14-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

pfil: add pfil_mem_{in,out}() and retire pfil_run_hooks()

The 0b70e3e78b0 changed the original design of a single entry point
into pfil(9) chains providing separate functions for the filtering
points that always provide mbufs and know the direction of a flow.
The motivation was to reduce branching. The logical continuation
would be to do the same for the filtering points that always provide
a memory pointer and retire the single entry point.

o Hooks now provide two functions: one for mbufs and optional for
memory pointers.
o pfil_hook_args() has a new member and pfil_add_hook() has a
requirement to zero out uninitialized data. Bump PFIL_VERSION.
o As it was before, a hook function for a memory pointer may realloc
into an mbuf. Such mbuf would be returned via a pointer that must
be provided in argument.
o The only hook that supports memory pointers is ipfw:default-link.
It is rewritten to provide two functions.
o All remaining uses of pfil_run_hooks() are converted to
pfil_mem_in().
o Transparent union of pfil_packet_t and tricks to fix pointer
alignment are retired. Internal pfil_realloc() reduces down to
m_devget() and thus is retired, too.

Reviewed by: mjg, ocochard
Differential revision: https://reviews.freebsd.org/D37977


# 9147969b 24-Jan-2023 Przemyslaw Lewandowski <przemyslawx.lewandowski@intel.com>

iflib: Add null check to iflib_stop()

Ever since gtaskqueue_drain() was added to iflib_stop(), a kernel panic
occurs when the ice(4) driver is in recovery mode. Queues are not
initialized in this mode, so gt_taskqueue is not initialized, and
gtaskqueue_drain() will panic.

Fix this by only doing a drain if an RX queue's gt_taskqueue is
initialized.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

Reviewed by: erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D37892


# 2c2b37ad 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move struct ifnet definition to a <net/if_private.h>

Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046


# 402810d3 20-Oct-2021 Justin Hibbits <jhibbits@FreeBSD.org>

Convert iflib(4) and iflib-based drivers to the DrvAPI

Summary:
Convert iflib(4) and the following drivers:
* axgbe
* em
* ice
* ixl
* vmxnet

Sponsored by: Juniper Networks, Inc.
Reviewed by: kbowling, #iflib
Differential Revision: https://reviews.freebsd.org/D37768


# 9c950139 17-Oct-2022 Eric Joyner <erj@FreeBSD.org>

iflib: Introduce v2 of TX Queue Select Functionality

For v2, iflib will parse packet headers before queueing a packet.

This commit also adds a new field in the structure that holds parsed
header information from packets; it stores the IP ToS/traffic class
field found in the IPv4/IPv6 header.

To help, it will only partially parse header packets before queueing
them by using a new header parsing function that does less than the
current parsing header function; for our purposes we only need up to the
minimal IP header in order to get the IP ToS infromation and don't need
to pull up more data.

For now, v1 and v2 co-exist in this patch; v1 still offers a
less-invasive method where none of the packet is parsed in iflib before
queueing.

This also bumps the sys/param.h version.

Signed-off-by: Eric Joyner <erj@FreeBSD.org>
Tested by: IntelNetworking
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D34742


# 0294e95d 21-Jul-2022 Dimitry Andric <dim@FreeBSD.org>

Fix unused variable warning in iflib.c

With clang 15, the following -Werror warning is produced:

sys/net/iflib.c:993:8: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
u_int n;
^

The 'n' variable appears to have been a debugging aid that has never
been used for anything, so remove it.

MFC after: 3 days


# d08cb453 08-Apr-2022 John Baldwin <jhb@FreeBSD.org>

iflib: Use empty inline functions for prefetch*() on non-x86.

This avoids warnings about unused variables in expressions passed to
prefetch*().


# 213e9139 29-Jul-2021 Eric Joyner <erj@FreeBSD.org>

iflib: Allow drivers to determine which queue to TX on

Adds a new function pointer to struct if_txrx in order to allow
drivers to set their own function that will determine which queue
a packet should be sent on.

Since this includes a kernel ABI change, bump the __FreeBSD_version
as well.

(This motivation behind this is to allow the driver to examine the
UP in the VLAN tag and determine which queue to TX on based on
that, in support of HW TX traffic shaping.)

Signed-off-by: Eric Joyner <erj@FreeBSD.org>

Reviewed by: kbowling@, stallamr@netapp.com
Tested by: jeffrey.e.pieper@intel.com
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D31485


# e0e12405 14-Jan-2022 Vincenzo Maffione <vmaffione@FreeBSD.org>

netmap: fix LOR in iflib_netmap_register

In iflib_device_register(), the CTX_LOCK is acquired first and then
IFNET_WLOCK is acquired by ether_ifattach(). However, in netmap_hw_reg()
we do the opposite: IFNET_RLOCK is acquired first, and then CTX_LOCK
is acquired by iflib_netmap_register(). Fix this LOR issue by wrapping
the CTX_LOCK/UNLOCK calls in iflib_device_register with an additional
IFNET_WLOCK. This is safe since the IFNET_WLOCK is recursive.

MFC after: 1 month


# 618d49f5 10-Jan-2022 Alexander Motin <mav@FreeBSD.org>

Revert "iflib: Relax timer period from 0.5 to 0.5-0.75s."

I've noticed relations between iflib_timer() vs ixl_admin_timer().
Both scheduled at the same 2Hz rate, but the second is rescheduling
the first each time, so if the first get any slower, it won't be
executed at all. Revert this until deeper investigation.

This reverts commit 90bc1cf65778aafb1f226c8fe08218cfed5e40b2.


# 90bc1cf6 09-Jan-2022 Alexander Motin <mav@FreeBSD.org>

iflib: Relax timer period from 0.5 to 0.5-0.75s.

While there switch it from hardclock ticks to milliseconds.

MFC after: 2 weeks


# e2650af1 29-Dec-2021 Stefan Eßer <se@FreeBSD.org>

Make CPU_SET macros compliant with other implementations

The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.

Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.

The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).

The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.

This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.

One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.

Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.

The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.

Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D33451


# 4561c4f0 28-Dec-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

net: iflib: sync isc_capenable to if_capenable

On SIOCSIFCAP, some bits in ifp->if_capenable may be toggled.
When this happens, apply the same change to isc_capenable, which
is the iflib private copy of if_capenable (for a subset of the
IFCAP_* bits). In this way the iflib drivers can check the bits
using isc_capenable rather than if_capenable. This is convenient
because the latter access requires an additional indirection
through the ifp, and it is also less likely to be in cache.

PR: 260068
Reviewed by: kbowling, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33156


# 1bfdb812 19-Nov-2021 Andriy Gapon <avg@FreeBSD.org>

iflib_stop: drain rx tasks to prevent any data races

iflib_stop modifies iflib data structures that are used by _task_fn_rx,
most prominently the free lists. So, iflib_stop has to ensure that the
rx task threads are not active.

This should help to fix a crash seen when iflib_if_ioctl (e.g.,
SIOCSIFCAP) is called while there is already traffic flowing.

The crash has been seen on VMWare guests with vmxnet3 driver.

My guess is that on physical hardware the couple of 1ms delays that
iflib_stop has after disabling interrupts are enough for the queued work
to be completed before any iflib state is touched.

But on busy hypervisors the guests might not get enough CPU time to
complete the work, thus there can be a race between the taskqueue
threads and the work done to handle an ioctl, specifically in iflib_stop
and iflib_init_locked.

PR: 259458
Reviewed by: markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D32926


# 66fa12d8 18-Aug-2021 Stephan de Wit <stephan.dewt@yahoo.co.uk>

iflib: emulate counters in netmap mode

When iflib devices are in netmap mode the driver
counters are no longer updated making it look from
userspace tools that traffic has stopped.

Reported by: Franco Fichtner <franco@opnsense.org>
Reviewed by: vmaffione, iflib (erj, gallatin)
Obtained from: OPNsense
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31550


# a5688853 06-Jul-2021 Mateusz Guzik <mjg@FreeBSD.org>

iflib: use m_gethdr_raw

Reviewed by: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31081


# bad5f0b6 30-Jun-2021 Mateusz Guzik <mjg@FreeBSD.org>

iflib: switch bare zone_mbuf use to m_free_raw

Reviewed by: kbowling
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30961


# 58632fa7 19-May-2021 Marcin Wojtas <mw@FreeBSD.org>

iflib: Add a new quirk

ENETC NIC found in LS1028A has a bug where clearing TX pidx/cidx
causes the ring to hang after being re-enabled.
Add a new flag, if set iflib will preserve the indices during restart.

Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: gallatin, erj
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D30728


# cd945dc0 27-Apr-2021 Marcin Wojtas <mw@FreeBSD.org>

iflib: Take iri_pad into account when processing small frames

Drivers can specify padding of received frames with iri_pad field.
This can be used to enforce ip alignment by hardware.
Iflib ignored that padding when processing small frames,
which rendered this feature inoperable.
I found it while writing a driver for a NIC that can ip align
received packets. Note that this doesn't change behavior of existing
drivers as they all set iri_pad to 0.

Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: gallatin
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D30009


# ca7005f1 25-Apr-2021 Patrick Kelsey <pkelsey@FreeBSD.org>

iflib: Improve mapping of TX/RX queues to CPUs

iflib now supports mapping each (TX,RX) queue pair to the same CPU
(default), to separate CPUs, or to a pair of physical and logical CPUs
that share the same L2 cache. The mapping mechanism supports unequal
numbers of TX and RX queues, with the excess queues always being
mapped to consecutive physical CPUs. When the platform cannot
distinguish between physical and logical CPUs, all are treated as
physical CPUs. See the comment on get_cpuid_for_queue() for the
entire matrix.

The following device-specific tunables influence the mapping process:
dev.<device>.<unit>.iflib.core_offset (existing)
dev.<device>.<unit>.iflib.separate_txrx (existing)
dev.<device>.<unit>.iflib.use_logical_cores (new)

The following new, read-only sysctls provide visibility of the mapping
results:
dev.<device>.<unit>.iflib.{t,r}xq<n>.cpu

When an iflib driver allocates TX softirqs without providing reference
RX IRQs, iflib now binds those TX softirqs to CPUs using the above
mapping mechanism (that is, treats them as if they were TX IRQs).
Previously, such bindings were left up to the grouptaskqueue code and
thus fell outside of the iflib CPU mapping strategy.

Reviewed by: kbowling
Tested by: olivier, pkelsey
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D24094


# 3183d0b6 23-Apr-2021 Andrew Gallatin <gallatin@FreeBSD.org>

iflib: initialize LRO unconditionally

Changes to the LRO code have exposed a bug in iflib where devices
which are not capable of doing LRO are still calling
tcp_lro_flush_all(), even when they have not initialized the LRO
context. This used to be mostly harmless, but the LRO code now sets
the VNET based on the ifp in the lro context and will try to access it
through a NULL ifp resulting in a panic at boot.

To fix this, we unconditionally initializes LRO so that we have a
valid LRO context when calling tcp_lro_flush_all(). One alternative is
to check the device capabilities before calling tcp_lro_flush_all() or
adding a new state flag in the ctx. However, it seems unwise to add an
extra, mostly useless test for higher performance devices when we can
just initialize LRO for all devices.

Reviewed by: erj, hselasky, markj, olivier
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29928


# 361e9501 05-Apr-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: add support for netmap offsets

Follow-up change to a6d768d845c173823785c71bb18b40074e7a8998.
This change adds iflib support for netmap offsets, enabling
applications to use offsets on any driver backed by iflib.


# 21d0c012 29-Mar-2021 you@x <you@x>

netmap: iflib: add nm_config callback

This per-driver callback is invoked by netmap when it wants
to align the number of TX/RX netmap rings and/or the number of
TX/RX netmap slots to the actual state configured in the hardware.
The alignment happens when netmap mode is switched on (with no
active netmap file descriptors for that netmap port), or when
collecting netmap port information.

MFC after: 1 week


# 09c3f04f 02-Mar-2021 Marcin Wojtas <mw@FreeBSD.org>

iflib: add support for admin completion queues

For interfaces with admin completion queues, introduce a new devmethod
IFDI_ADMIN_COMPLETION_HANDLE and a corresponding flag IFLIB_HAS_ADMINCQ.

This provides an option for handling any admin cq logic, which cannot be
run from an interrupt context.

Said method is called from within iflib's admin task, making it safe to
sleep.

Reviewed by: mmacy
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28708


# ef567155 24-Feb-2021 Marcin Wojtas <mw@FreeBSD.org>

Fix powerpc build after 6dd69f0064f1

Commit 6dd69f0064f1 ("iflib: introduce isc_dma_width")
failed to build on powerpc due to implicit type conversion
error. Fix that.

Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.


# 6dd69f00 24-Feb-2021 Marcin Wojtas <mw@FreeBSD.org>

iflib: introduce isc_dma_width

Some DMA controllers are unable to address the full host memory space
and are instead limited to a subset of address range (e.g. 48-bit).

Allow the driver to specify the maximum allowed DMA addressing width
(in bits) for the NIC hardware, by introducing a new field in
if_softc_ctx.

If said field is omitted (set to 0), the lowaddr of DMA window bounds
defaults to BUS_SPACE_MAXADDR.

Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28706


# b6999635 24-Feb-2021 Mark Johnston <markj@FreeBSD.org>

iflib: Avoid double counting in rxeof

iflib_rxeof() was counting everything twice. This was introduced when
pfil hooks were added to the iflib receive path. We want to count rx
packets/bytes before the pfil hooks are executed, so remove the counter
adjustments that are executed after.

PR: 253583
Reviewed by: gallatin, erj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28900


# 2ccf971a 19-Feb-2021 John Baldwin <jhb@FreeBSD.org>

iflib: Cast the result of iflib_netmap_txq_init() to void.

This fixes a warning from GCC for kernels without netmap since the
return value is never used.

Reviewed by: vmaffione, erj
Differential Revision: https://reviews.freebsd.org/D28598


# 922cf8ac 14-Feb-2021 Allan Jude <allanjude@FreeBSD.org>

Use iflib_if_init_locked() during media change instead of iflib_init_locked().

iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for media changes.
iflib_if_init_locked() calls stop then init, so fixes the problem.

PR: 253473
MFC after: 3 days
Reviewed by: markj
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D28667


# 38bfc6de 01-Feb-2021 Sai Rajesh Tallamraju <stallamr@netapp.com>

iflib: Free resources in a consistent order during detach

Memory and PCI resources are freed with no particular order. This could
cause use-after-frees when detaching following a failed attach. For
instance, iflib_tx_structures_free() frees ctx->ifc_txqs[] but
iflib_tqg_detach() attempts to access this array. Similarly, adapter
queues gets freed by IFDI_QUEUES_FREE() but IFDI_DETACH() attempts to
access adapter queues to free PCI resources.

MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27634


# 3f43ada9 28-Jan-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.

Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.


# f80efe50 24-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: move per-packet operation out of fragments loop

MFC after: 1 week


# aceaccab 24-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: add support for NS_MOREFRAG

The NS_MOREFRAG flag can be set in a netmap slot to represent a
multi-fragment packet. Only the last fragment of a packet does
not have the flag set. On TX rings, the flag may be set by the
userspace application. The kernel will look at the flag and use it
to properly set up the NIC TX descriptors.
On RX rings, the kernel may set the flag if the packet received
was split across multiple netmap buffers. The userspace application
should look at the flag to know when the packet is complete.

Submitted by: rajesh1.kumar_amd.com
Reviewed by: vmaffione
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27799


# 0c864213 21-Jan-2021 Andrew Gallatin <gallatin@FreeBSD.org>

iflib: Fix a NULL pointer deref

rxd_frag_to_sd() have pf_rv parameter as NULL with the current
code. This patch fixes the NULL pointer dereference in that
case thus avoiding a possible panic.

Submitted by: rajesh1.kumar at amd.com
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D28115


# 55f0ad5f 10-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

netmap: restore hwofs and support it in iflib

Restore the hwofs functionality temporarily disabled by
7ba6ecf216fb15e8b147db2 to prevent issues with iflib.
This patch brings the necessary changes to iflib to
enable howfs to allow interface restarts without
disrupting netmap applications actively using its
rings.
After this change, it becomes possible for multiple
non-cooperating netmap applications to use non-overlapping
subsets of the available netmap rings without clashing
with each other.

PR: 252453
MFC after: 1 week


# 8aa8484c 10-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: fix build failure in case DEV_NETMAP is not defined

This addresses the build failure introduced by
3d65fd97e85ab807f3baa62.

MFC with: 3d65fd97e85ab807f3baa62


# 4ba9ad0d 10-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: add assert to prevent out-of-bounds array access

The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs
for each rxqset, and links each struct to a different free list.
As a result, it must be isc_nrxqs >= isc_nfl (plus the completion
queue, if present).
Add an assertion to make this constraint explicit.

MFC after: 2 weeks


# 3d65fd97 09-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

netmap: iflib: enable/disable krings on any interface reinit

Since 1d238b07d5d4d9660ae0e0, krings are disabled before
a reinit cycle triggered by iflib_netmap_register.
However, this operation is actually necessary also for
any interface reinit triggered by other causes (i.e.,
ifconfig commands).
We achieve this goal by moving the krings enable/disable
operation inside iflib_stop() and iflib_init_locked().

Once here, this change also removes some redundant operations
from iflib_netmap_register(), that are already performed by
iflib_stop().

PR: 252453
MFC after: 1 week


# 3189ba61 09-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

netmap: iflib: fix asserts in netmap_fl_refill()

When netmap_fl_refill() is called at initialization time (e.g.,
during netmap_iflib_register()), nic_i must be 0, since the
free list is reinitialized. At the end of the refill cycle, nic_i
must still be zero, because exactly N descriptors (N is the ring size)
are refilled.
This patch therefore fixes the assertions to check on nic_i rather
than on nm_i. The current netmap_reset() may in fact cause nm_i
to be != 0 while the device is resetting: this may happen when
multiple non-cooperating processes open different subsets of the
available netmap rings.

PR: 252518
MFC after: 1 week


# 1d238b07 09-Jan-2021 Vincenzo Maffione <vmaffione@FreeBSD.org>

netmap: iflib: stop krings during interface reset

When different processes open separate subsets of the
available rings of a same netmap interface, a device
reset may be performed while one of the processes
is actively using some rings (e.g., caused by another
process executing a nmport_open()).
With this patch, such situation will cause the
active process to get a POLLERR, so that it can
have a chance to detect the situation.
We also guarantee that no process is running a txsync
or rxsync (ioctl or poll) while an iflib device reset
is in progress.

PR: 252453
MFC after: 1 week


# 81be6552 18-Dec-2020 Matt Macy <mmacy@FreeBSD.org>

iflib: ensure that tx interrupts enabled and cleanups

Doing a 'dd' over iscsi will reliably cause stalls. Tx
cleaning _should_ reliably happen as data is sent.
However, currently if the transmit queue fills it will
wait until the iflib timer (hz/2) runs.

This change causes the the tx taskq thread to be run
if there are completed descriptors.

While here:

- make timer interrupt delay a sysctl

- simplify txd_db_check handling

- comment on INTR types

Background on the change:

Initially doorbell updates were minimized by only writing to the register
on every fourth packet. If txq_drain would return without writing to the
doorbell it scheduled a callout on the next tick to do the doorbell write
to ensure that the write otherwise happened "soon". At that time a sysctl
was added for users to avoid the potential added latency by simply writing
to the doorbell register on every packet. This worked perfectly well for
e1000 and ixgbe ... and appeared to work well on ixl. However, as it
turned out there was a race to this approach that would lockup the ixl MAC.
It was possible for a lower producer index to be written after a higher one.
On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial
response was to add a lock around doorbell writes - fixing the problem but
adding an unacceptable amount of lock contention.

The next iteration was to use transmit interrupts to drive delayed doorbell
writes. If there were no packets in the queue all doorbell writes would be
immediate as the queue started to fill up we could delay doorbell writes
further and further. At the start of drain if we've cleaned any packets we
know we've moved the state machine along and we write the doorbell (an
obvious missing optimization was to skip that doorbell write if db_pending
is zero). This change required that tx interrupts be scheduled periodically
as opposed to just when the hardware txq was full. However, that just leads
to our next problem.

Initially dedicated msix vectors were used for both tx and rx. However, it
was often possible to use up all available vectors before we set up all the
queues we wanted. By having rx and tx share a vector for a given queue we
could halve the number of vectors used by a given configuration. The problem
here is that with this change only e1000 passed the necessary value to have
the fast interrupt drive tx when appropriate.

Reported by: mav@
Tested by: mav@
Reviewed by: gallatin@
MFC after: 1 month
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D27683


# c065d4e5 07-Dec-2020 Mark Johnston <markj@FreeBSD.org>

iflib: Avoid leaking the freelist bitmaps upon driver detach

Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27342


# 10254019 07-Dec-2020 Mark Johnston <markj@FreeBSD.org>

iflib: Detach tasks upon device registration failure

In some error paths we would fail to detach from the iflib taskqueue
groups. Also move the detach code into its own subroutine instead of
duplicating it.

Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27342


# 54bf96fb 11-Nov-2020 Mark Johnston <markj@FreeBSD.org>

iflib: Free full mbuf chains when draining transmit queues

Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com>
Reviewed by: gallatin, hselasky
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27179


# be7a6b3d 28-Oct-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: fix typo bug introduced by r367093

Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.

PR: 248652
Reported by: sg@efficientip.com
MFC with: r367093


# 17cec474 27-Oct-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: add per-tx-queue netmap timer

The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
- The netmap_tx_irq() function gets called by iflib_timer(), which
gets scheduled with tick granularity (hz). This is not frequent
enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
result is that the transmitting netmap application is not woken
up fast enough to saturate the link with small packets.
- The iflib_timer() functions also calls isc_txd_credits_update()
to ask for more TX completion updates. However, this violates
the netmap requirement that only txsync can access the TX queue
for datapath operations. Only netmap_tx_irq() may be called out
of the txsync context.

This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.

This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.

Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.

PR: 248652
Reported by: sg@efficientip.com
MFC after: 2 weeks


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 35d8a463 01-Sep-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: leave only 1 receive descriptor unused

The pidx argument of isc_rxd_flush() indicates which is the last valid
receive descriptor to be used by the NIC. However, current code has
multiple issues:
- Intel drivers write pidx to their RDT register, which means that
NICs will only use the descriptors up to pidx-1 (modulo ring size N),
and won't actually use the one pointed by pidx. This does not break
reception, but it is anyway confusing and suboptimal (the NIC will
actually see only N-2 descriptors as available, rather than N-1).
Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic).
- The semantic used by Intel (RDT is one descriptor past the last
valid one) is used by most (if not all) NICs, and it is also used
on the TX side (also in iflib). Since iflib is not currently
using this semantic for RX, it must decrement fl->ifl_pidx
(modulo N) before calling isc_rxd_flush(), and then the
per-driver callback implementation must increment the index
again (to match the real semantic). This is confusing and suboptimal.
- The iflib refill function is also called at initialization.
However, in case the ring size is smaller than 128 (e.g. if_mgb),
the refill function will actually prepare all the receive
descriptors (N), without leaving one unused, as most of NICs assume
(e.g. to avoid RDT to overrun RDH). I can speculate that the code
looks like this right now because this issue showed up during
testing (e.g. with if_mgb), and it was easy to workaround by
decrementing pidx before isc_rxd_flush().

The goal of this change is to simplify the code (removing a bunch
of instructions from the RX fast path), and to make the semantic of
isc_rxd_flush() consistent across drivers. To achieve this, we:
- change the semantics of the pidx argument to the usual one (that
is the index one past the last valid one), so that both iflib and
drivers avoid the decrement/increment dance.
- fix the initialization code to prepare at most N-1 descriptors.

Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26191


# ae750d5c 25-Aug-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: publish all the receive buffer

At initialization time, the netmap RX refill function used to
prepare the NIC RX ring with N-1 buffers rather than N (with
N equal to the number of descriptors in the NIC RX ring).
This is not how netmap is supposed to work, as it would keep
kring->nr_hwcur not in sync with the NIC "next index to refill"
(i.e., fl->ifl_pidx). Instead we prepare N buffers, although we
still publish (with isc_rxd_flush()) only the first N-1 buffers,
to avoid the NIC producer pointer to overrun the NIC consumer
pointer (for NICs where this is a real issue, e.g. Intel ones).

MFC after: 2 weeks


# de5b4610 24-Aug-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: fix isc_rxd_flush call in netmap_fl_refill()

The semantic of the pidx argument of isc_rxd_flush() is the
last valid index of in the free list, rather than the next
index to be published. However, netmap was still using the
old convention. While there, also refactor the netmap_fl_refill()
to simplify a little bit and add an assertion.

MFC after: 2 weeks


# 6d84e76a 12-Aug-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: improve rxsync to support IFLIB_HAS_RXCQ

For drivers with IFLIB_HAS_RXCQ set, there is a separate completion
queue. In this case, the netmap rxsync routine needs to update
rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof().
This improves the situation for vmx(4) and bnxt(4) drivers, which
use iflib and have the IFLIB_HAS_RXCQ bit set.

PR: 248494
MFC after: 3 weeks


# 530960be 12-Aug-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: refactor netmap_fl_refill and fix off-by-one issue

First, fix the initialization of the fl->ifl_rxd_idxs array,
which was affected by an off-by-one bug.
Once there, refactor the function to use better names for
local variables, optimize the variable assignments, and
merge the bus_dmamap_sync() inner loop with the outer one.

PR: 248494
MFC after: 3 weeks


# c9d886cd 06-Aug-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: drop redundant check

The validity of head is already checked by nm_rxsync_prologue().

MFC after: 2 weeks


# ee07345d 06-Aug-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: don't increment ifl_cidx on the wrong free list

Netmap only uses free list 0 to keep it consistent with its
one-to-one mapping between each netmap ring and a device RX
(or TX) queue.
However, the current iflib_netmap_rxsync() routine was
mistakenly updating the ifl_cidx field of both free lists.

PR: 248494
MFC after: 2 weeks


# 0ae0e8d2 26-Jul-2020 Matt Macy <mmacy@FreeBSD.org>

iflib: fix LOR with bpf detach

Reported by: grehan@
Approved by: grehan@
MFC after: 1 week
Sponsored by: Netgate
Differential Revision: https://reviews.freebsd.org/D25530


# ac11d857 20-Jul-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: initialize netmap with the correct number of descriptors

In case the network device has a RX or TX control queue, the correct
number of TX/RX descriptors is contained in the second entry of the
isc_ntxd (or isc_nrxd) array, rather than in the first entry.
This case is correctly handled by iflib_device_register() and
iflib_pseudo_register(), but not by iflib_netmap_attach().
If the first entry is larger than the second, this can result in a
panic. This change fixes the bug by introducing two helper functions
that also lead to some code simplification.

PR: 247647
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D25541


# b256d25c 06-Jul-2020 Mark Johnston <markj@FreeBSD.org>

iflib: Fix some nits in the rx refill code.

- Get rid of the ifl_vm_addrs array. It is not used by any existing
consumer, so we are just dirtying a couple of cache lines for no
reason.
- Use uma_zalloc(fl->ifl_zone) instead of m_cljget(). Otherwise
m_cljget() is doing unnecessary work to look up the correct zone, when
iflib already knows what that zone is.
- ifl_gen is only used when INVARIANTS is on, so make that more clear.
- Fix some style nits and inconsistencies.

Reviewed by: gallatin
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25490


# a363e1d4 06-Jul-2020 Mark Johnston <markj@FreeBSD.org>

iflib: Fix handling of mbuf cluster allocation failures.

When refilling an rx freelist, make sure we only update the hardware
producer index if at least one cluster was allocated. Otherwise the
NIC is programmed to write a previously used cluster, typically
resulting in a use-after-free when packet data is written by the
hardware.

Also make sure that we don't update the fragment index cursor if the
last allocation attempt didn't succeed. For at least Intel drivers,
iflib assumes that the consumer index and fragment index cursor stay in
lockstep, but this assumption was violated in the face of cluster
allocation failures.

Reported and tested by: pho
Reviewed by: gallatin, hselasky
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25489


# 9503233f 25-Jun-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: fix compilation issue introduced in r362621

The ifp local variable is useful even without netmap
and altq, as it is used to check for IFF_DRV_RUNNING.

MFC after: 2 weeks


# d8b2d26b 25-Jun-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: add support for partial ring openings

Reviewed by: gallatin
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25254


# 88a68866 25-Jun-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: add per-tx-queue netmap support

Reviewed by: gallatin
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25253


# 0ff21267 23-Jun-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: fix rsync index overrun

In the current iflib_netmap_rxsync, there is nothing that prevents
kring->nr_hwtail to overrun kring->nr_hwcur during the descriptor
import phase. This may cause errors in netmap applications, such as:

em1 RX0: fail 'head < kring->nr_hwcur || head > kring->nr_hwtail'
h 795 c 795 t 282 rh 795 rc 795 rt 282 hc 282 ht 282

Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25252


# 9aeca213 21-Jun-2020 Matt Macy <mmacy@FreeBSD.org>

iflib: fix cloneattach fail and generalize pseudo device handling

- a cloneattach failure will not currently be handled correctly,
jump to the right target

- pseudo devices are all treat as if they're ethernet devices -
this often doesn't make sense

MFC after: 1 week
Sponsored by: Netgate, Inc.
Differential Revision: https://reviews.freebsd.org/D25083


# 0a182b4c 14-Jun-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: enter/exit netmap mode after device stops

Avoid possible race conditions by calling nm_set_native_flags()
and nm_clear_native_flags() only after the device has been
stopped.

MFC after: 1 week


# e136e9c8 09-Jun-2020 Vincenzo Maffione <vmaffione@FreeBSD.org>

iflib: netmap: honor netmap_irx_irq return values

In the receive interrupt routine, always call netmap_rx_irq().
The latter function will return != NM_IRQ_PASS if netmap is not
active on that specific receive queue, so that the driver can go
on with iflib_rxeof(). Note that netmap supports partial opening,
where only a subset of the RX or TX rings can be open in netmap mode.
Checking the IFCAP_NETMAP flag is not enough to make sure that the
queue is indeed in netmap mode.
Moreover, in case netmap_rx_irq() returns NM_IRQ_RESCHED, it means
that netmap expects the driver to call netmap_rx_irq() again as soon
as possible. Currently, this may happen when the device is attached
to a VALE switch.

Reviewed by: gallatin
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25167


# 1f93e931 31-May-2020 Matt Macy <mmacy@FreeBSD.org>

Fix panics when using iflib pseudo device support

Reviewed by: gallatin@, hselasky@
MFC after: 1 week
Sponsored by: Netgate, Inc.
Differential Revision: https://reviews.freebsd.org/D23710


# 814fa34d 30-Apr-2020 Mark Johnston <markj@FreeBSD.org>

Increase the iflib txq callout mutex name length to 32 bytes.

With a length of 16, the name ("<if name>:TX(<qid>):callout") typically
gets truncated.

PR: 245712
Reported by: ghuckriede@blackberry.com
MFC after: 1 week


# 45818bf1 27-Apr-2020 Eric Joyner <erj@FreeBSD.org>

iflib: Stop interface before (un)registering VLAN

This patch is intended to solve a specific problem that iavf(4)
encounters, but what it does can be extended to solve other issues.

To summarize the iavf(4) issue, if the PF driver configures VLAN
anti-spoof, then the VF driver needs to make sure no untagged traffic is
sent if a VLAN is configured, and vice-versa. This can be an issue when
a VLAN is being registered or unregistered, e.g. when a packet may be on
the ring with a VLAN in it, but the VLANs are being unregistered. This
can cause that tagged packet to go out and cause an MDD event.

To fix this, include a new interface-dependent function that drivers can
implement named IFDI_NEEDS_RESTART(). Right now, this function is called
in iflib_vlan_unregister/register() to determine whether the interface
needs to be stopped and started when a VLAN is registered or
unregistered. The default return value of IFDI_NEEDS_RESTART() is true,
so this fixes the MDD problem that iavf(4) encounters, since the
interface rings are flushed during a stop/init.

A future change to iavf(4) will implement that function just in case the
default value changes, and to make it explicit that this interface reset
is required when a VLAN is added or removed.

Reviewed by: gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22086


# 59d50fe5 30-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Simplify taskqgroup inititialization.

taskqgroup initialization was broken into two steps:

1. allocate the taskqgroup structure, at SI_SUB_TASKQ;
2. initialize taskqueues, start taskqueue threads, enqueue "binder"
tasks to bind threads to specific CPUs, at SI_SUB_SMP.

Step 2 tries to handle the case where tasks have already been attached
to a queue, by migrating them to their intended queue. In particular,
tasks can't be enqueued before step 2 has completed. This breaks NFS
mountroot on systems using an iflib-based driver when EARLY_AP_STARTUP
is not defined, since mountroot happens before SI_SUB_SMP in this case.

Simplify initialization: do all initialization except for CPU binding at
SI_SUB_TASKQ. This means that until CPU binding is completed, group
tasks may be executed on a CPU other than that to which they were bound,
but this should not be a problem for existing users of the taskqgroup
KPIs.

Reported by: sbruno
Tested by: bdragon, sbruno
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24188


# ed6611cc 24-Mar-2020 Ed Maste <emaste@FreeBSD.org>

iflib: simplify MPASS assertion

Submitted by: andrew


# 68af0153 24-Mar-2020 Ed Maste <emaste@FreeBSD.org>

iflib: split compound assertion

ThunderX cluster systems are panicking on boot with a failed assertion
MPASS(gtask != NULL && gtask->gt_taskqueue != NULL). Split the
assertion so that it's clear which part is failing.


# 87699691 14-Mar-2020 Patrick Kelsey <pkelsey@FreeBSD.org>

Remove extraneous code from iflib

ifsd_cidx is never used, and the line removed from rxd_frag_to_sd() is
just dead code.

Reviewed by: erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23951


# 3caff188 14-Mar-2020 Patrick Kelsey <pkelsey@FreeBSD.org>

Remove refill budget from iflib

Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23948


# b3813609 14-Mar-2020 Patrick Kelsey <pkelsey@FreeBSD.org>

Allow iflib drivers to specify the buffer size used for each receive queue

Reviewed by: erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23947


# e5030490 14-Mar-2020 Patrick Kelsey <pkelsey@FreeBSD.org>

Remove freelist contiguous-indexes assertion from rxd_frag_to_sd()

The vmx driver is an example of an iflib driver that might report
packets using non-contiguous descriptors (with unused descriptors
either between received packets or between the fragments of a received
packet), so this assertion needs to be removed.

For such drivers, the freelist producer and consumer indexes don't
relate directly to driver ring slots (the driver deals directly with
freelist buffer indexes supplied by iflib during refill, and reports
them with each fragment during packet reception), but do continue to
be used by iflib for accounting, such as determining the number of
ring slots that are refillable.

PR: 243126, 243392, 240628
Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23946


# 4f2beb72 14-Mar-2020 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix iflib zero-length fragment handling

The dmamap for zero-length fragments should not be unloaded, as doing
so breaks the the cluster-reuse logic in _iflib_fl_refill().

All zero-length fragments are now handled by the assemble_segments()
path so that the cluster-reuse logic there does not have to be
replicated in the small-single-fragment-packet path of
iflib_rxd_pkt_get().

Packets consisting entirely of zero-length fragments (which result in
a NULL mbuf pointer) are now properly tolerated. This allows drivers
(such as the vmx driver) to pass such packets to iflib when a
descriptor error occurs during packet reception, the advantage being
that the refill of descriptors associated with the error packet are
handled via the existing iflib machinery without having to duplicate
parts of that machinery in the driver to handle that error case.

Reviewed by: avg, erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23945


# 9e9b738a 14-Mar-2020 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix iflib freelist state corruption

This fixes a bug in iflib freelist management that breaks the required
correspondence between freelist indexes and driver ring slots.

PR: 243126, 243392, 240628
Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
Reviewed by: avg, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23943


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# e87c4940 24-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Although most of the NIC drivers are epoch ready, due to peer pressure
switch over to opt-in instead of opt-out for epoch.

Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.

Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.

Mark several tested drivers as IFF_KNOWSEPOCH.

Reviewed by: hselasky, jeff, bz, gallatin
Differential Revision: https://reviews.freebsd.org/D23674


# f98977b5 12-Feb-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Use NET_TASK_INIT() and NET_GROUPTASK_INIT() for drivers that process
incoming packets in taskqueue context.

This patch extends r357772.

Tested by: yp@mm.st
Sponsored by: Mellanox Technologies


# fb1a29b4 12-Feb-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Make sure the so-called end of receive interrupts don't starve in iflib.

When the receive ring cannot be filled with mbufs, due to lack of memory,
no more interrupts may be generated to fill the receive ring later on.
Make sure to have a watchdog, to try refilling the receive ring from time
to time, hopefully when more mbufs are available.

Differential Revision: https://reviews.freebsd.org/D23315
MFC after: 1 week
Reviewed by: gallatin@
Sponsored by: Mellanox Technologies


# 6c3e93cb 11-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Use NET_TASK_INIT() and NET_GROUPTASK_INIT() for drivers that process
incoming packets in taskqueue context.

Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D23518


# 0b8df657 22-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Enter network epoch in iflib rxeof task.

In upcoming changes ether_input() is going to be changed not
to enter the network epoch. It is going to be responsibility
of network interrupt. In case of iflib - its taskqueue.


# f6afed72 02-Jan-2020 Eric Joyner <erj@FreeBSD.org>

iflib: Prevent watchdog from resetting idle queues

While changing link state in iflib_link_state_change(), queues are
marked as IFLIB_QUEUE_IDLE to disable watchdog. Currently, iflib_timer()
watchdog does not check for previous queue status before marking it as
IFLIB_QUEUE_HUNG.

This patch adds check of queue status before marking it as hung.

Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>

PR: 239240
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reported by: ultima@
Reviewed by: gallatin@, erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21712


# db8e8f1e 04-Nov-2019 Eric Joyner <erj@FreeBSD.org>

iflib: properly release memory allocated for DMA

DMA memory allocations using the bus_dma.h interface are not properly
released in all cases for both Tx and Rx. This causes ~448 bytes of
M_DEVBUF allocations to be leaked.

First, the DMA maps for Rx are not properly destroyed. A slight attempt
is made in iflib_fl_bufs_free to destroy the maps if we're detaching.
However, this function may not be reliably called during detach. Indeed,
there is a comment "asking" if this should be moved out.

Fix this by moving the bus_dmamap_destroy call into iflib_rx_sds_free,
where we already sync and unload the DMA.

Second, the DMA tag associated with the ifr_ifdi descriptor DMA is not
released properly anywhere. Add a call to iflib_dma_free in
iflib_rx_structures_free.

Third, use of NULL as a canary value on the map pointer returned by
bus_dmamap_create is not valid. On some platforms, notably x86, this
value may be NULL. In this case, we fail to properly release the related
resources.

Remove the NULL checks on map values in both iflib_fl_bufs_free and
iflib_txsd_destroy.

With all of these fixes applied, the leaks to M_DEVBUF are squelched,
and iflib drivers now seem to properly cleanup when detaching.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22203


# 244e7cff 30-Oct-2019 Eric Joyner <erj@FreeBSD.org>

iflib: cleanup memory leaks on driver detach

From Jake:
The iflib stack failed to release all of the memory allocated under
M_IFLIB during device detach.

Specifically, the ifmp_ring, the ift_ifdi Tx DMA info, and the ifr_ifdi Rx
DMA info were not being released.

Release this memory so that iflib won't leak memory when a device
detaches.

Since we're freeing the ift_ifdi pointer during iflib_txq_destroy we
need to call this only after iflib_dma_free in iflib_tx_structures_free.

Additionally, also ensure that we destroy the callout mutex associated
with each Tx queue when we free it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22157


# 1558015e 23-Oct-2019 Eric Joyner <erj@FreeBSD.org>

iflib: call ether_ifdetach and netmap_detach before stop

From Jake:
Calling ether_ifdetach after iflib_stop leads to a potential race where
a stale ifp pointer can remain in the route entry list for IPv6 traffic.
This will potentially cause a page fault or other system instability if
the ifp pointer is accessed.

Move both iflib_netmap_detach and ether_ifdetach to be called prior to
iflib_stop. This avoids the race above, and helps ensure that other ifp
references are removed before stopping the interface.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, jhb@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22071


# 7790c8c1 17-Oct-2019 Conrad Meyer <cem@FreeBSD.org>

Split out a more generic debugnet(4) from netdump(4)

Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421


# 41669133 30-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Add IFLIB_SINGLE_IRQ_RX_ONLY.

As of r347221 the iflib legacy interrupt mode setup assumes that drivers
perform both receive and transmit processing from the interrupt handler.
This assumption is invalid in the vmxnet3 driver, so introduce the
IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the
interrupt handler.

PR: 239118
Reported and tested by: Juraj Lutter <otis@sk.freebsd.org>
Obtained from: marius
Reviewed by: gallatin
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21831


# 6554362c 27-Sep-2019 Andrew Gallatin <gallatin@FreeBSD.org>

kTLS support for TLS 1.3

TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by: jhb, hselasky
Sponsored by: Netflix
Differential Revision: D21801


# 53b5b9b0 24-Sep-2019 Eric Joyner <erj@FreeBSD.org>

iflib: Remove redundant VLAN events deregistration

From Piotr:
r351152 introduced iflib_deregister() function calling
EVENTHANDLER_DEREGISTER() to unregister VLAN events. This patch removes
duplicate of EVENTHANDLER_DEREGISTER() calls placed in
iflib_device_deregister() as this function is now calling
iflib_deregister(). This is to avoid deregistering same event twice.

This patch also adds check in iflib_vlan_register() to prevent
registering VLAN while being in detach.

Patch co-authored by Krzysztof Galazka <krzysztof.galazka@intel.com>,
erj <erj@FreeBSD.org> and Jacob Keller <jacob.e.keller@intel.com>.

Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>

Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reviewed by: gallatin@, erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21711


# 56614414 16-Aug-2019 Eric Joyner <erj@FreeBSD.org>

iflib: add iflib_deregister to help cleanup on exit

Commit message by Jake:
The iflib_register function exists to allocate and setup some common
structures used by both iflib_device_register and iflib_pseudo_register.

There is no associated cleanup function used to undo the steps taken in
this function.

Both iflib_device_deregister and iflib_pseudo_deregister have some of
the necessary steps scattered in their flow. However, most of the
necessary cleanup is not done during the error path of
iflib_device_register and iflib_pseudo_register.

Some examples of missed cleanup include:

the ifp pointer is not free'd during error cleanup
the STATE and CTX locks are not destroyed during error cleanup
the vlan event handlers are not removed during error cleanup
media added to the ifmedia structure is not removed
the kobject reference is never deleted
Additionally, when initializing the kobject class reference counter is
increased even though kobj_init already increases it. This results in
the class never being free'd again because the reference count would
never hit zero even after all driver instances are unloaded.

To aid in proper cleanup, implement an iflib_deregister function that
goes through the reverse steps taken by iflib_register.

Call this function during the error cleanup for iflib_device_register
and iflib_pseudo_register. Additionally call the function in the
iflib_device_deregister and iflib_pseudo_deregister functions near the
end of their flow. This helps reduce code duplication and ensures that
proper steps are taken to cleanup allocations and references in both the
regular and error cleanup flows.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21005


# 197c6798 01-Aug-2019 Eric Joyner <erj@FreeBSD.org>

iflib: Prevent kernel panic caused by loading driver with a specific interrupt configuration

If a device has only 1 MSI-X interrupt available and does not support either
MSI or legacy interrupts, iflib_device_register() will fail, leak memory and
MSI resources, and the driver will not load. Worse, if another iflib-using
driver tries to unload afterwards, a kernel panic will occur because the
previous failed iflib driver loead did not properly call "taskqgroup_detach()"
during it's cleanup.

This patch is band-aid for this situation -- don't try allocating MSI or legacy
interrupts if a single MSI-X interrupt was allocated, but fail to load instead.
As well, during the cleanup, properly call taskqgroup_detach() on the admin
task to prevent panics when other iflib drivers unload.

This whole interrupt allocation process actually needs re-doing to properly
support devices with only a single MSI-X interrupt, devices that only support
MSI-X, non-PCI devices, and multiple non-MSIX interrupts, as well.

Signed-off-by: Eric Joyner <erj@freebsd.org>

Reviewed by: marius@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D20747


# 6a3f243b 01-Aug-2019 Eric Joyner <erj@FreeBSD.org>

iflib: remove kobject class reference increment

Commit message from Jake:
In iflib_register, the context is initialized as a kobject using the
device driver's "driver" kobject class. As part of this, the function
mistakenly increments the ref counter.

The ref counter is incremented twice, once in the code directly, and
once again by kobj_class_compile. However, there is no associated
decrement in the detach path. Because of this, the ref counter will
never go back down to zero, and thus the kobject method table will never
be released.

Remove this unnecessary reference count increment.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: jhb@, erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21125


# 7f3f6aad 24-Jul-2019 Eric Joyner <erj@FreeBSD.org>

iflib: fix dangling device softc pointer

Commit text by Jake:
If a driver's IFDI_ATTACH_PRE function fails, the iflib_device_register
function will free the ctx pointer. However, it does not reset the
device softc pointer to NULL.

This will result in memory corruption as a future access to the now
invalid pointer will corrupt memory that is later allocated on top of
the same memory location.

The iflib_device_deregister function correctly resets the softc pointer
by using device_set_softc().

This clears up the invalid dangling pointer and prevents memory
corruption that could lead to a panic or undefined behavior if the
device's driver failed to attach.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21003


# c2c5d1e7 26-Jun-2019 Marius Strobl <marius@FreeBSD.org>

o In iflib_txq_drain():
- Remove desc_used, which is only ever written to.
- Remove a dead store to reclaimed.
- Don't recycle avail.
- Sort variables according to style(9).
These changes will make a subsequent commit easier to read.
o In iflib_tx_credits_update(), don't bother checking whether the
ift_txd_credits_update method pointer is NULL; _iflib_pre_assert()
asserts upfront that this method has been assigned and functions
like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}()
and _task_fn_tx() were already unconditionally relying on the
method being callable.


# 188adcb7 19-Jun-2019 Marko Zec <zec@FreeBSD.org>

V_ip6_forwarding and V_ipforwarding have been defined in ip6_var.h /
ip_var.h since at least 2008, so make use of those definitions here.

MFC after: 3 days


# 6aee0bfa 19-Jun-2019 Marko Zec <zec@FreeBSD.org>

Evaluating htons() at compile time is more efficient than doing ntohs()
at runtime. This change removes a dependency on a barrel shifter pass
before branch resolution, while reducing the instruction stream size
by 9 bytes on amd64.

MFC after: 3 days


# d49e83ea 15-Jun-2019 Marius Strobl <marius@FreeBSD.org>

- Replace unused and only ever written to members of public iflib(9)
structs with placeholders (in the latter case, IFLIB_MAX_TX_BYTES
etc. are also only ever used for these write-only members if at all,
so both these macros and members can just go). Using these spares
may render it possible to merge certain iflib(9) fixes to stable/12.
Otherwise, changes extending struct if_irq or struct if_shared_ctx
in any way would break KBI as instances of these are allocated by
the driver front-ends (by contrast, struct if_pkt_info as well as
struct if_softc_ctx instances are provided by iflib(9) and, thus,
may grow at least at the end without breaking KBI).
- Make the pvi_name in struct pci_vendor_info const char * as device
identifiers in hardware lookup tables aren't to be expected to ever
change at runtime.
- Similarly, make the pci_vendor_info_t of struct if_shared_ctx which
is used to point to the struct pci_vendor_info arrays provided by
the driver front-ends const.
- Remove the ETH_ADDR_LEN macro from iflib.h; this was duplicating
ETHER_ADDR_LEN of <net/ethernet.h> with iflib(9) actually only
consuming the latter macro.
- Make the name argument of iflib_io_tqg_attach(9) const, matching
the taskqgroup_attach_cpu(9) this function wraps as well as e. g.
iflib_config_gtask_init(9).
- Remove the orphaned iflib_qset_lock_get() prototype.
- Remove some extraneous empty lines.


# 668d6dbb 29-May-2019 Eric Joyner <erj@FreeBSD.org>

iflib: provide probe wrapper for vendor drivers

From Jake:
Vendor drivers that exist out-of-tree generally should return
BUS_PROBE_VENDOR from their device probe functions. This helps ensure
that a vendor replacement driver will supersede the in-kernel driver for
a given device.

Currently, if a vendor wants to implement a driver based on iflib, it
will always report BUS_PROBE_DEFAULT.

Add a wrapper function, iflib_device_probe_vendor() which can be used in
place of iflib_device_probe(). This function will just return
BUS_PROBE_VENDOR whenever iflib_device_probe() would return
BUS_PROBE_DEFAULT.

While vendor drivers can already implement such a wrapper themselves,
providing it in the iflib.h header makes it easier for the vendor driver
to do the right thing.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, marius@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D20221


# afb77372 09-May-2019 Eric Joyner <erj@FreeBSD.org>

iflib: use default ntxd and nrxd when user value is not power of 2

From Jake:
A user may set a sysctl to override the default number of Tx or Rx
descriptors. However, certain calculations in the iflib core expect the
number of descriptors to be a power of 2.

Update _iflib_assert to verify that all of the shared context parameters
for the number of descriptors are powers of 2.

Modify iflib_reset_qvalues to check that the provided isc_nrxd value is
a power of 2. If it's not, print a warning message and then use the
default value.

An alternative might be to try rounding the number down instead.
However, this creates problems in case the rounded down value is below
the minimum value that the driver would support.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: marius@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19880


# 007b804f 08-May-2019 Marius Strobl <marius@FreeBSD.org>

Allow to build without INET and INET6 again after r347221.

Submitted by: cam


# 3d10e9ed 07-May-2019 Marius Strobl <marius@FreeBSD.org>

o Use iflib_fast_intr_rxtx() also for "legacy" interrupts, i. e. INTx and
MSI. Unlike as with iflib_fast_intr_ctx(), the former will also enqueue
_task_fn_tx() in addition to _task_fn_rx() if appropriate, bringing TCP
TX throughput of EM-class devices on par with the MSI-X case and, thus,
close to wirespeed/pre-iflib(4) times again. [1]
Note that independently of the interrupt type, the UDP performance with
these MACs still is abysmal and nowhere near to where it was before the
conversion of em(4) to iflib(4).
o In iflib_init_locked(), announce which free list failed to set up.
o In _task_fn_tx() when running netmap(4), issue ifdi_intr_enable instead
of the ifdi_tx_queue_intr_enable method in case of a "legacy" interrupt
as the latter is valid with MSI-X only.
o Instead of adding the missing - and apparently convoluted enough that a
DBG_COUNTER_INC was put into a wrong spot in _task_fn_rx() - checks for
ifdi_{r,t}x_queue_intr_enable being available in the MSI-X case also to
iflib_fast_intr_rxtx(), factor these out to iflib_device_register() and
make the checks fail gracefully rather than panic. This avoids invoking
the checks at runtime over and over again in iflib_fast_intr_rxtx() and
_task_fn_{r,t}x() - even if it's just in case of INVARIANTS - and makes
these functions more readable.
o In iflib_rx_structures_setup(), only initialize LRO resources if device
and driver have LRO capability in order to not waste memory. Also, free
the LRO resources again if setting them up fails for one of the queues.
However, don't bother invoking iflib_rx_sds_free() in that case because
iflib_rx_structures_setup() doesn't call iflib_rxsd_alloc() either (and
iflib_{device,pseudo}_register() will issue iflib_rx_sds_free() in case
of failure via iflib_rx_structures_free(), but there definitely is some
asymmetry left to be fixed, though).
o Similarly, free LRO resources again in iflib_rx_structures_free().
o In iflib_irq_set_affinity(), handle get_core_offset() errors gracefully
instead of panicing (but only in case of INVARIANTS). This is a follow-
up to r344132, as such driver bugs shouldn't be fatal.
o Likewise, handle unknown iflib_intr_type_t in iflib_irq_alloc_generic()
gracefully, too.
o Bring yet more sanity to iflib_msix_init():
- If the device doesn't provide enough MSI-X vectors or not all vectors
can be allocate so the expected number of queues in addition to admin
interrupts can't be supported, try MSI next (and then INTx) as proper
MSI-X vector distribution can't be assured in such cases. In essence,
this change brings r254008 forward to iflib(4). Also, this is the fix
alluded to in the commit message of r343934.
- If the MSI-X allocation has failed, don't prematurely announce MSI is
going to be used as the latter in fact may not be available either.
- When falling back to MSI, only release the MSI-X table resource again
if it was allocated in iflib_msix_init(), i. e. isn't supplied by the
driver, in the first place.
o In mp_ndesc_handler(), handle unknown type arguments gracefully, too.

PR: 235031 (likely) [1]
Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D20175


# 1722eeac 06-May-2019 Marius Strobl <marius@FreeBSD.org>

- Remove the unused ifc_link_irq and ifc_mtx_name members of struct iflib_ctx.
- Remove the only ever written to ift_db_mtx_name member of struct iflib_txq.
- Remove the unused or only ever written to ifr_size, ifr_cq_pidx, ifr_cq_gen
and ifr_lro_enabled members of struct iflib_rxq.
- Consistently spell DMA, RX and TX uppercase in comments, messages etc.
instead of mixing with some lowercase variants.
- Consistently use if_t instead of a mix of if_t and struct ifnet pointers.
- Bring the function comments of _iflib_fl_refill(), iflib_rx_sds_free() and
iflib_fl_setup() in line with reality.
- Judging problem reports, people are wondering what on earth messages like:
"TX(0) desc avail = 1024, pidx = 0"
are trying to indicate. Thus, extend this string to be more like that of
non-iflib(4) Ethernet MAC drivers, notifying about a watchdog timeout due
to which the interface will be reset.
- Take advantage of the M_HAS_VLANTAG macro.
- Use false/true rather than FALSE/TRUE for variables of type bool.
- Use FALLTHROUGH as advocated by style(9).


# e2621d96 03-May-2019 Matt Macy <mmacy@FreeBSD.org>

Allow iflib drivers to pass a pointer to their own ifmedia structure.

Tested by: emaste@

Differential Revision: https://reviews.freebsd.org/D19946


# ce3da455 02-May-2019 Ed Maste <emaste@FreeBSD.org>

iflib: remove assertion that isc_capabilities is nonzero

It's atypical, but not invalid, for a driver to pass no capabilities.

Submitted by: Gerald Aryeetey <aryeeteygerald_rogers.com>
Reviewed by: shurd
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20142


# f154ece0 25-Apr-2019 Stephen Hurd <shurd@FreeBSD.org>

iflib: Better control over queue core assignment

By default, cores are now assigned to queues in a sequential
manner rather than all NICs starting at the first core. On a four-core
system with two NICs each using two queue pairs, the nic:queue -> core
mapping has changed from this:

0:0 -> 0, 0:1 -> 1
1:0 -> 0, 1:1 -> 1

To this:

0:0 -> 0, 0:1 -> 1
1:0 -> 2, 1:1 -> 3

Additionally, a device can now be configured to use separate cores for TX
and RX queues.

Two new tunables have been added, dev.X.Y.iflib.separate_txrx and
dev.X.Y.iflib.core_offset. If core_offset is set, the NIC is not part
of the auto-assigned sequence.

Reviewed by: marius
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D20029


# 6d49b41e 24-Apr-2019 Andrew Gallatin <gallatin@FreeBSD.org>

iflib: Add pfil hooks

As with mlx5en, the idea is to drop unwanted traffic as early
in receive as possible, before mbufs are allocated and anything
is passed up the stack. This can save considerable CPU time
when a machine is under a flooding style DOS attack.

The major change here is to remove the unneeded abstraction where
callers of rxd_frag_to_sd() get back a pointer to the mbuf ring, and
are responsible for NULL'ing that mbuf themselves. Now this happens
directly in rxd_frag_to_sd(), and it returns an mbuf. This allows us
to use the decision (and potentially mbuf) returned by the pfil
hooks. The driver can now recycle mbufs to avoid re-allocation when
packets are dropped.

Reviewed by: marius (shurd and erj also provided feedback)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19645


# 1fd8c72c 17-Apr-2019 Kyle Evans <kevans@FreeBSD.org>

iflib: Use new ether_gen_addr, restricting addresses to that subset

Differential Revision: https://reviews.freebsd.org/D19587


# 225eae1b 28-Mar-2019 Eric Joyner <erj@FreeBSD.org>

iflib: return ENETDOWN when the network device is down

From Jake:
iflib_if_transmit returns ENOBUFS when the device is down, or when the
link isn't active.

This was changed in r308792 from return (0), so that the function
correctly reports an error that it was unable to transmit.

However, using ENOBUFS can cause some network applications to produce
the following or similar errors:

"ping: sendto: No buffer space available"

This is a bit confusing as the real cause of the issue is that the
network device is down.

Replace the ENOBUFS return with ENETDOWN to indicate more clearly that
the reason for the failure to send is due to the network device is
offline.

This will cause the error message to be reported as

"ping: sendto: Network is down"

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, sbruno@, bz@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19652


# aac9c817 28-Mar-2019 Eric Joyner <erj@FreeBSD.org>

iflib: hold the CTX lock in iflib_pseudo_register

From Jake:
The iflib_device_register function takes the CTX lock before calling
IFDI_ATTACH_PRE, and releases it upon finishing the registration.

Mirror this process in iflib_pseudo_register, so that we always hold the
CTX lock during the attach process when registering a pseudo interface
or a regular interface.

This was caught by code inspection while attempting to analyze where the
CTX lock was held.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19604


# 10a1e981 19-Mar-2019 Eric Joyner <erj@FreeBSD.org>

iflib: mark isc_driver_version as constant

From Jake:
The iflib core never modifies the isc_driver_version string. Allow
drivers to safely assign pointers to constant buffers by marking this
parameter const.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, jhb@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19577


# 1b9d9394 19-Mar-2019 Eric Joyner <erj@FreeBSD.org>

iflib: expose the Rx mbuf buffer size to drivers

From Jake:
iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based
on the isc_max_frame_size value that drivers setup. This calculation is
repeated by drivers when programming their hardware with the size of
each Rx buffer.

This can lead to a mismatch where the iflib mbuf size is different from
the expected size of the buffer as programmed by the hardware. This can
lead to unexpected results.

If iflib ever wants to support mbuf sizes larger than one page, every
driver must be updated to account for the new possible buffer sizes.

Fix this by calculating the mbuf size prior to calling IFDI_INIT, and
adding the iflib_get_rx_mbuf_sz function which will expose this value to
drivers, so that they do not repeat the same calculation.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19489


# 3e8d1bae 19-Mar-2019 Eric Joyner <erj@FreeBSD.org>

iflib: prevent possible infinite loop in iflib_encap

From Jake:
iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an
m_collapse and an m_defrag are attempted to shrink the mbuf cluster to
fit within the DMA segment limitations.

However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns
EFBIG on the now defragmented mbuf, we will continuously re-call
bus_dmamap_load_mbuf_sg over and over.

This happens because m_head isn't NULL, and remap is >1, so we don't try
to m_collapse or m_defrag again. The only way we exit the loop is if
m_head is NULL. However, m_head can't be modified by the call to
bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer.

I believe this will be an incredibly rare occurrence, because it is
unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second
defragment with an EFBIG error. However, it still seems like
a possibility that we should account for.

Fix the exit check to ensure that if remap is >1, we will also exit,
even if m_head is not NULL.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19468


# bc408c7d 05-Mar-2019 Eric Joyner <erj@FreeBSD.org>

Remove references to CONTIGMALLOC_WORKS in iflib and em

From Jake:
"The iflib_fl_setup() function tries to pick various buffer sizes based
on the max_frame_size value defined by the parent driver. However, this
code was wrapped under CONTIGMALLOC_WORKS, which was never actually
defined anywhere.

This same code pattern was used in if_em.c, likely trying to match
what iflib uses.

Since CONTIGMALLOC_WORKS is not defined, remove this dead code from
iflib_fl_setup and if_em.c

Given that various iflib drivers appear to be using a similar
calculation, it might be worth making this buffer size a value that the
driver can peek at in the future."

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19199


# ca62461b 15-Feb-2019 Stephen Hurd <shurd@FreeBSD.org>

iflib: Improve return values of interrupt handlers.

iflib was returning FILTER_HANDLED, in cases where FILTER_STRAY was more
correct. This potentially caused issues with shared legacy interrupts.

Driver filters returning FILTER_STRAY are now properly handled.

Submitted by: Augustin Cavalier <waddlesplash@gmail.com>
Reviewed by: marius, gallatin
Obtained from: Haiku (a84bb9, 4947d1)
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D19201


# a6611c93 12-Feb-2019 Marius Strobl <marius@FreeBSD.org>

Fix the build with ALTQ after r344060.


# f855ec81 12-Feb-2019 Marius Strobl <marius@FreeBSD.org>

Make taskqgroup_attach{,_cpu}(9) work across architectures

So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.

While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.

Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139


# 95dcf343 12-Feb-2019 Marius Strobl <marius@FreeBSD.org>

Further correct and optimize the bus_dma(9) usage of iflib(4):
o Correct the obvious bugs in the netmap(4) parts:
- No longer check for the existence of DMA maps as bus_dma(9)
is used unconditionally in iflib(4) since r341095.
- Supply the correct DMA tag and map pairs to bus_dma(9)
functions (see also the commit message of r343753).
- In iflib_netmap_timer_adjust(), add synchronization of the
TX descriptors before calling the ift_txd_credits_update
method as the latter evaluates the TX descriptors possibly
updated by the MAC.
- In _task_fn_tx(), wrap the netmap(4)-specific bits in
#ifdef DEV_NETMAP just as done in _task_fn_admin() and
_task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
the RX descriptors before calling the ift_txd_credits_update
method (see also above).
o There's no need to synchronize an RX buffer that is going to
be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
do that as late as passing RX buffers to the MAC via the
ift_rxd_refill method. Hence, combine that synchronization
with the synchronization of new buffers into a common spot
in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
list in preparation of the MAC updating their statuses with
every invocation of rxd_frag_to_sd(); it's enough to do this
once before handing control over to the MAC, i. e. before
calling ift_rxd_flush method in _iflib_fl_refill(), which
already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
descriptors which possibly have been altered by the MAC,
synchronize as appropriate beforehand. Most notably this
is now done in iflib_rxd_avail(), which in turn means that
we don't need to issue the same synchronization yet again
before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
before handing them over to the MAC for transmission via
the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
the invocation of the ift_txd_encap() method. If the MAC
driver fails to encapsulate the packet and we retry with
a defragmented mbuf chain or finally fail, the cycles for
TX buffer synchronization have been wasted. Synchronizing
afterwards matches what non-iflib(4) drivers typically do
and is sufficient as the MAC will not actually start with
the transmission before - in this case - the ift_txd_flush
method is called.
Moreover, for the latter reason the synchronization of the
TX descriptors in iflib_encap() can go as it's enough to
synchronize them before passing control over to the MAC by
issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
if the ift_txd_credits_update method accessing these is
actually called.

Differential Revision: https://reviews.freebsd.org/D19081


# bfce461e 04-Feb-2019 Marius Strobl <marius@FreeBSD.org>

o As illustrated by e. g. figure 7-14 of the Intel 82599 10 GbE
controller datasheet revision 3.3, in the context of Ethernet
MACs the control data describing the packet buffers typically
are named "descriptors". Each of these descriptors references
one buffer, multiple of which a packet can be composed of.
By contrast, in comments, messages and the names of structure
members, iflib(4) refers to DMA resources employed for RX and
TX buffers (rather than control data) as "desc(riptors)".
This odd naming convention of iflib(4) made reviewing r343085
and identifying wrong and missing bus_dmamap_sync(9) calls in
particular way harder than it already is. This convention may
also explain why the netmap(4) part of iflib(4) pairs the DMA
tags for control data with DMA maps of buffers and vice versa
in calls to bus_dma(9) functions.
Therefore, change iflib(4) to refer to buf(fers) when buffers
and not the usual understanding of descriptors is meant. This
change does not include corrections to the DMA resources used
in the netmap(4) parts. However, it revises error messages to
state which kind of allocation/creation failed. Specifically,
the "Unable to allocate tx_buffer (map) memory" copy & pasted
inappropriately on several occasions was replaced with proper
messages.
o Enhance some other error messages to indicate which half - RX
or TX - they apply to instead of using identical text in both
cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
- Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
- change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
return values are already checked, deferred DMA allocations
not being an option at this point, BUS_DMA_NOWAIT has to be
used anyway and prior malloc(9) calls in this function also
specify M_NOWAIT.

Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D19067


# b97de13a 30-Jan-2019 Marius Strobl <marius@FreeBSD.org>

- Stop iflib(4) from leaking MSI messages on detachment by calling
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
on the corresponding resources instead of using the RIDs initially
passed to bus_alloc_resource_any(9) as the latter function may
change those RIDs. Solely em(4) for the ioport resource (but not
others) and bnxt(4) were using the correct RIDs by caching the ones
returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
> 0. Otherwise the "Unable to map MSIX table " message triggers for
devices that simply don't support MSI-X and the user may think that
something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
and em(4) during attachment under bootverbose. The non-verbose
output of em(4) seen during attachment now is close to the one
prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
drivers and correct a few others.

Reviewed by: erj, Jacob Keller, shurd
Differential Revision: https://reviews.freebsd.org/D18980


# 3db348b5 26-Jan-2019 Marius Strobl <marius@FreeBSD.org>

- In _iflib_fl_refill(), don't mark an RX buffer as available in the
corresponding bitmap before adding an mbuf has actually succeeded.
Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
RX ring but not in its bitmap. One implication of such a hole was
that in a subsequent call to _iflib_fl_refill() with the RX buffer
accounting still indicating another reclaimable buffer, bit_ffc(3)
nevertheless returned -1 in frag_idx which in turn caused havoc
when used as an index. Thus, additionally assert that frag_idx is
0 or greater.
Another possible consequence of a hole in the RX ring was a NULL-
dereference when trying to use the unallocated mbuf, for example
in iflib_rxd_pkt_get().

While at it, make the variable declarations in _iflib_fl_refill()
conform to style(9) and remove redundant checks already performed
by bit_ffc{,_at}(3).

- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).

Reported and tested by: pho


# 77102fd6 25-Jan-2019 Andrew Gallatin <gallatin@FreeBSD.org>

Fix an iflib driver unload panic introduced in r343085

The new loop to sync and unload descriptors was indexed
by "i", rather than "j". The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.

Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix


# 8f82136a 21-Jan-2019 Patrick Kelsey <pkelsey@FreeBSD.org>

onvert vmx(4) to being an iflib driver.

Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add
iflib_dma_alloc_align() to the iflib API.

Performance is generally better with the tunable/sysctl
dev.vmx.<index>.iflib.tx_abdicate=1.

Reviewed by: shurd
MFC after: 1 week
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18761


# 7f3eb9da 21-Jan-2019 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix various resource leaks that can occur in the error paths of
iflib_device_register() and iflib_pseudo_register().

Reviewed by: shurd
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18760


# 8a04b53d 15-Jan-2019 Konstantin Belousov <kib@FreeBSD.org>

Improve iflib busdma(9) KPI use.

- Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since
callbacks are not supposed to be used.
- Match tso/non-tso tags to corresponding tx map operations. Create
separate tso maps for tx descriptors. In particular, do not use
non-tso tag to load, unload, or destroy a map created with tso tag.
- Add missed bus_dmamap_sync() calls.
Submitted by: marius.

Reported and tested by: pho
Reviewed by: marius
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cd28ea92 07-Jan-2019 Stephen Hurd <shurd@FreeBSD.org>

Use iflib_if_init_locked() during resume instead of iflib_init_locked().

iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for suspend. iflib_if_init_locked() calls stop then init,
so fixes the problem.

This was causing errors after a resume from suspend.

PR: 224059
Reported by: zeising
MFC after: 1 week
Sponsored by: Limelight Networks


# 85f3b801 02-Jan-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix typo, use boolean operator instead of bit-wise.

Reviewed by: marius, shurd
MFC after: 3 days
Sponsored by: The FreeBSD Foundation


# 7124b5ba 11-Dec-2018 Stephen Hurd <shurd@FreeBSD.org>

Fix !tx_abdicate error from r336560

r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.

Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.

Reported by: lev
MFC after: 1 week
Sponsored by: Limelight Networks


# fbec776d 27-Nov-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Use busdma unconditionally in iflib

- Remove the complex mechanism to choose between using busdma
and raw pmap_kextract at runtime. The reduced complexity makes
the code easier to read and maintain.

- Fix a bug in the small packet receive path where clusters were
repeatedly mapped but never unmapped. We now store the cluster's
bus address and avoid re-mapping the cluster each time a small
packet is received.

This patch fixes bugs I've seen where ixl(4) will not even
respond to ping without seeing DMAR faults.

I see a small improvement (14%) on packet forwarding tests using
a Haswell based Xeon E5-2697 v3. Olivier sees a small
regression (-3% to -6%) with lower end hardware.

Reviewed by: mmacy
Not objected to by: sbruno
MFC after: 8 weeks
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D17901


# 0efb1a46 14-Nov-2018 Stephen Hurd <shurd@FreeBSD.org>

Clear RX completion queue state veriables in iflib_stop()

iflib_stop() was not resetting the rxq completion queue state variables.
This meant that for any driver that has receive completion queues, after a
reinit, iflib would start asking what's available on the rx side starting at
whatever the completion queue index was prior to the stop, instead of at 0.

Submitted by: pkelsey
Reported by: pkelsey
MFC after: 3 days
Sponsored by: Limelight Networks


# 8d4ceb9c 14-Nov-2018 Stephen Hurd <shurd@FreeBSD.org>

Prevent POLA violation with TSO/CSUM offload

Ensure that any time CSUM_IP_TSO or CSUM_IP6_TSO is set that the corresponding
CSUM_IP6?_TCP / CSUM_IP flags are also set.

Rather than requireing drivers to bake-in an understanding that TSO implies
checksum offloads, make it explicit.

This change requires us to move the IFLIB_NEED_ZERO_CSUM implementation to
ensure it's zeroed for TSO.

Reported by: Jacob Keller <jacob.e.keller@intel.com>
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17801


# 4d261ce2 14-Nov-2018 Stephen Hurd <shurd@FreeBSD.org>

Fix leaks caused by ifc_nhwtxqs never being initialized

r333502 removed initialization of ifc_nhwtxqs, and it's not clear
there's a need to copy it into the struct iflib_ctx at all. Use
ctx->ifc_sctx->isc_ntxqs instead.

Further, iflib_stop() did not clear the last ring in the case where
isc_nfl != isc_nrxqs (such as when IFLIB_HAS_RXCQ is set). Use
ctx->ifc_sctx->isc_nrxqs here instead of isc_nfl.

Reported by: pkelsey
Reviewed by: pkelsey
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17979


# a42546df 07-Nov-2018 Stephen Hurd <shurd@FreeBSD.org>

Fix rxcsum issue introduced in r338838

r338838 attempted to fix issues with rxcsum and rxcsum6.
However, the rxcsum bits were set as though if_setcapenablebit() was
being called, not if_togglecapenable() which is in use. As a result,
it was not possible to disable rxcsum when rxcsum6 was supported.

PR: 233004
Reported by: lev
Reviewed by: lev
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17881


# 46fa0c25 23-Oct-2018 Eric Joyner <erj@FreeBSD.org>

Revert r339634.

That commit is causing kernel panics in em(4), so this will be reverted
until those are fixed.

Reported by: ae@, pho@, et al
Sponsored by: Intel Corporation


# 940f62d6 22-Oct-2018 Eric Joyner <erj@FreeBSD.org>

iflib: drain enqueued tasks before detaching from taskqgroup

The taskqgroup_detach function does not check if task is already enqueued when
detaching it. This may lead to kernel panic if enqueued task starts after
context state lock is destroyed. Ensure that the already enqueued admin tasks
are executed before detaching them.

The issue was discovered during validation of D16429. Unloading of if_ixlv
followed by immediate removal of VFs with iovctl -D may lead to panic on
NODEBUG kernel.

As well, check if iflib is in detach before enqueueing new admin or iov
tasks, to prevent new tasks from executing while the taskqgroup tasks
are being drained.

Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by: shurd@, erj@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D17404


# 77c1fcec 12-Oct-2018 Eric Joyner <erj@FreeBSD.org>

ixl/iavf(4): Change ixlv to iavf and update it to use iflib(9)

Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).

This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.

The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.

A man page update that documents these drivers is forthcoming in a separate
commit.

Reviewed by: sbruno@, kbowling@
Tested by: jeffrey.e.pieper@intel.com
Approved by: re (gjb@)
Relnotes: yes
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429


# 0c919c23 20-Sep-2018 Stephen Hurd <shurd@FreeBSD.org>

Fix capabilities handling for iflib drivers

Various capabilities were not being handled correctly in the
SIOCSIFCAP handler. Specifically:

IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported

It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via
ifconfig since it does ioctl() per command-line flag rather than combine
them into a single call.

IFCAP_VLAN_HWCSUM could not be modified via the ioctl()

Setting any combination of the three IFCAP_WOL flags would set only
IFCAP_WOL_MCAST | IFCAP_WOL_MAGIC. For example, setting only
IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC
being enabled, but IFCAP_WOL_UCAST would not be enabled.

Because if_vlancap() was called before if_togglecapenable(), vlan flags
were sometimes not applied correctly.

Interfaces were being unnecessarily stopped and restarted for WoL

PR: 231151
Submitted by: Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: galladin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17158


# 64e6fc13 06-Sep-2018 Stephen Hurd <shurd@FreeBSD.org>

Clean up iflib sysctls

Remove sysctls:
txq_drain_encapfail - now a duplicate of encap_txd_encap_fail
intr_link - was never incremented
intr_msix - was never incremented
rx_zero_len - was never incremented

The following were not incremented in all code-paths that apply:
m_pullups, mbuf_defrag, rxd_flush, tx_encap, rx_intr_enables, tx_frees,
encap_txd_encap_fail.

Fixes:
Replace the broken collapse_pkthdr() implementation with an MPASS().
fl_refills and fl_refills_large were not incremented when using netmap.

Reviewed by: gallatin
Approved by: re (marius)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16733


# bc0e855b 29-Aug-2018 Stephen Hurd <shurd@FreeBSD.org>

Fix compile error due to missing parenthesis in r338372

Approved by: re (gjb)


# a520f8b6 29-Aug-2018 Stephen Hurd <shurd@FreeBSD.org>

Fix potential data corruption in iflib

The MP ring may have txq pointers enqueued. Previously, these were
passed to m_free() when IFC_QFLUSH was set. This patch checks for
the value and doesn't call m_free().

Reviewed by: gallatin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16882


# 8f410865 03-Aug-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

Mark the send queue ready so ALTQ is available.


# b8ca4756 25-Jul-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

ALTQ support for iflib.

Reviewed by: jmallett, mmacy
Differential Revision: https://reviews.freebsd.org/D16433


# c9a49a4f 24-Jul-2018 Marius Strobl <marius@FreeBSD.org>

Since r336611, n is only used for INET in iflib_parse_header().
Reported by: rpokala


# 7474544b 22-Jul-2018 Marius Strobl <marius@FreeBSD.org>

Use the maximum of isc_tx_{nsegments,tso_segments_max} for MAX_TX_DESC.
Since r336313, TSO support for LEM-class devices is removed again as it
was before the conversion of {l,}em(4) to iflib(4) in r311849 and as a
result, isc_tx_tso_segments_max is 0 for LEM-class devices now. Thus,
inappropriate watermarks were used for this class.

This is really only a band-aid, though, because so far iflib(9) doesn't
fully take into account that DMA engines can support different maxima
of segments for transfers of TSO and non-TSO packets. For example, the
DESC_RECLAIMABLE macro is based on isc_tx_nsegments while MAX_TX_DESC
used isc_tx_tso_segments_max only. For most in-tree consumers that
doesn't make a difference as the maxima are the same for both kinds of
transfers (that is, apart from the fact that TSO may require up to 2
sentinel descriptors but also not with every MAC supported). However,
isc_tx_nsegments is 8 but isc_tx_tso_segments_max is 85 by default
with ixl(4).


# 8b8d9093 22-Jul-2018 Marius Strobl <marius@FreeBSD.org>

- Given that the controlling expression of the receive loop in iflib_rxeof()
tests for avail > 0, avail can never be 0 within that loop. Thus, move
decrementing avail and budget_left into the loop and before the code which
checks for additional descriptors having become available in case all the
previous ones have been processed but there still is budget left so the
latter code works as expected. [1]
- In iflib_{busdma_load_mbuf_sg,parse_header}(), remove dead stores to m
and n respectively. [2, 3]
- In collapse_pkthdr(), ensure that m_next isn't NULL before dereferencing
it. [4]
- Remove a duplicate assignment of segs in iflib_encap().

Reported by: Coverity
CID: 1356027 [1], 1356047 [2], 1368205 [3], 1356028 [4]


# fe51d4cd 20-Jul-2018 Stephen Hurd <shurd@FreeBSD.org>

Add knob to control tx ring abdication.

r323954 changed the mp ring behaviour when 64-bit atomics were
available to abdicate the TX ring rather than having one become a
consumer thereby running to completion on TX. The consumer of the mp
ring was then triggered in the tx task rather than blocking the TX call.
While this significantly lowered the number of RX drops in small-packet
forwarding, it also negatively impacts TX performance.

With this change, the default behaviour is reverted, causing one TX ring
to become a consumer during the enqueue call. A new sysctl,
dev.X.Y.iflib.tx_abdicate is added to control this behaviour.

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16302


# dd7fbcf1 20-Jul-2018 Stephen Hurd <shurd@FreeBSD.org>

Improve netmap TX handling when TX IRQs are not used/supported

Use the timer to poll for TX completions when there are
outstanding TX slots. Track when the last driver timer was called
to prevent overcalling it. Also clean up some kring vs NIC ring
usage.

Reviewed by: marius, Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16300


# 7f87c040 15-Jul-2018 Marius Strobl <marius@FreeBSD.org>

Assorted TSO fixes for em(4)/iflib(9) and dead code removal:
- Ever since the workaround for the silicon bug of TSO4 causing MAC hangs
was committed in r295133, CSUM_TSO always got disabled unconditionally
by em(4) on the first invocation of em_init_locked(). However, even with
that problem fixed, it turned out that for at least e. g. 82579 not all
necessary TSO workarounds are in place, still causing MAC hangs even at
Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled
in r323292 (r323293 for stable/10) for the EM-class by default, allowing
users to turn it on if it happens to work with their particular EM MAC
in a Gigabit-only environment.
In head, the TSO workaround for speeds other than Gigabit was lost with
the conversion to iflib(9) in r311849 (possibly along with another one
or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4
got enabled by default again, causing device hangs. Therefore, change the
default for this hardware class back to have TSO4 off, allowing users
to turn it on manually if it happens to work in their environment as
we do in stable/{10,11}. An alternative would be to add a whitelist of
EM-class devices where TSO4 actually is reliable with the workarounds in
place, but given that the advantage of TSO at Gigabit speed is rather
limited - especially with the overhead of these workarounds -, that's
really not worth it. [1]
This change includes the addition of an isc_capabilities to struct
if_softc_ctx so iflib(9) can also handle interface capabilities that
shouldn't be enabled by default which is used to handle the default-off
capabilities of e1000 as suggested by shurd@ and moving their handling
from em_setup_interface() to em_if_attach_pre() accordingly.
- Although 82543 support TSO4 in theory, the former lem(4) didn't have
support for TSO4, presumably because TSO4 is even more broken in the
LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class
devices was enabled as part of the conversion to iflib(9) in r311849,
causing device hangs. So revert back to the pre-r311849 behavior of
not supporting TSO4 for LEM-class at all, which includes not creating
a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2]
- In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET
(65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO
DMA must have a maxsize of the maximum TSO size plus the size of a
VLAN header for software VLAN tagging. The iflib(9) converted em(4),
thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE
in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET
in em_setup_interface() (apparently, left-over from pre-iflib(9)
times). So remove the later and correct iflib(9) to correctly cap
the maximum TSO size reported to the stack at IP_MAXPACKET. While at
it, let iflib(9) use if_sethwtsomax*().
This change includes the addition of isc_tso_max{seg,}size DMA engine
constraints for the TSO DMA tag to struct if_shared_ctx and letting
iflib_txsd_alloc() automatically adjust the maxsize of that tag in case
IFCAP_VLAN_MTU is supported as requested by shurd@.
- Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet
header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9)
so adjustment is automatically done in case IFCAP_VLAN_MTU is supported.
As a consequence, this adjustment now is also done in case of bnxt(4)
which missed it previously.
- Move the reduction of the maximum TSO segment count reported to the
stack by the number of m_pullup(9) calls (which in the worst case,
can add another mbuf and, thus, the requirement for another DMA
segment each) in the transmit path for performance reasons from
em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now
done in iflib_parse_header() rather than in the no longer existing
em_xmit(). Moreover, this optimization applies to all drivers using
iflib(9) and not just em(4); all in-tree iflib(9) consumers still
have enough room to handle full size TSO packets. Also, reduce the
adjustment to the maximum number of m_pullup(9)'s now performed in
iflib_parse_header().
- Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9)
in r311849 and r335338 respectively, these drivers didn't enable
IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed
through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on
by default but also lagg(4) was fixed in that regard in r203548. So
just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling
in {em,ixl,ixlv}_setup_interface().
- Nuke other redundant IFCAP_* setting in {em,ixl,ixlv}_setup_interface()
which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre()
now.
- Remove some redundant/dead setting of scctx->isc_tx_csum_flags in
em_if_attach_pre().
- Remove some IFCAP_* duplicated either directly or indirectly (e. g.
via IFCAP_HWCSUM) in {EM,IGB,IXL}_CAPS.
- Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as
iflib(9) adds that capability unconditionally.
- Remove some unused macros from em(4).
- Bump __FreeBSD_version as some of the above changes require the modules
of drivers using iflib(9) to be recompiled.

Okayed by: sbruno@ at 201806 DevSummit Transport Working Group [1]
Reviewed by: sbruno (earlier version), erj
PR: 219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2]
Differential Revision: https://reviews.freebsd.org/D15720


# dfae03b5 18-Jun-2018 Eric Joyner <erj@FreeBSD.org>

iflib: Style fixes

MFC after: 1 week


# e4defe55 17-Jun-2018 Marius Strobl <marius@FreeBSD.org>

Assorted fixes to MSI-X/MSI/INTx setup in iflib(9):
- In iflib_msix_init(), VMMs with broken MSI-X activation are trying
to be worked around by manually enabling PCIM_MSIXCTRL_MSIX_ENABLE
before calling pci_alloc_msix(9). Apart from constituting a layering
violation, this has the problem of leaving PCIM_MSIXCTRL_MSIX_ENABLE
enabled when falling back to MSI or INTx when e. g. MSI-X is black-
listed and initially also when disabled via hw.pci.enable_msix. The
later in turn was incorrectly worked around in r325166.
Since r310806, pci(4) itself has code to deal with broken MSI-X
handling of VMMs, so all of these workarounds in iflib(9) can go,
fixing non-working interrupts when falling back to MSI/INTx. In
any case, possibly further adjustments to broken MSI-X activation
of VMMs like enabling r310806 by default in VM environments need to
be placed into pci(4), not iflib(9). [1]
- Also remove the pci_enable_busmaster(9) call from iflib_msix_init(),
which is already more properly invoked from iflib_device_attach().
- When falling back to MSI/INTx, release the MSI-X BAR resource again.
- When falling back to INTx, ensure scctx->isc_vectors is set to 1 and
not to something higher from a device with more than one MSI message
supported.
- Make the nearby ring_state(s) stuff (static) const.

Discussed with: jhb at BSDCan 2018 [1]
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D15729


# 3ab4a960 08-Jun-2018 Stephen Hurd <shurd@FreeBSD.org>

Remove tx task spinning added in r333686

This caused issues with PASTE. Just remove the reschedule since the DELAY()
should be enough for use cases such as pkt-gen which were failing before the
change.

Reported by: Michio Honda
Sponsored by: Limelight Networks


# a06424dd 07-Jun-2018 Eric Joyner <erj@FreeBSD.org>

iflib: Record TCP checksum info in iflib when TCP checksum is requested

ixl(4) (when it switches over to using iflib) devices need the TCP header
length in order to do TCP checksum offload.

Reviewed by: gallatin@, shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15558


# 3e0e6330 29-May-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: mark irq allocation name parameter as constant

The *name parameter passed to iflib_irq_alloc_generic and
iflib_softirq_alloc_generic is never modified. Many places in code pass
string literals and thus should not be modified.

Mark the *name parameter as a const char * instead, so that we enforce
that the name is not modified before passing to bus_describe_intr()

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: kmacy
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15343


# 6c3c3194 29-May-2018 Matt Macy <mmacy@FreeBSD.org>

iflib: hold context lock across detach for drivers that need it


# 1d7ef186 25-May-2018 Eric Joyner <erj@FreeBSD.org>

iflib: Add new shared flag: IFLIB_ADMIN_ALWAYS_RUN

ixl(4)'s nvmupdate utility expects the nvmupdate process to run
while the interface is down; these nvm update commands use the
admin queue, so the admin queue needs to be able to generate
interrupts and be processed while the interface is down.

So add a flag that ixl(4) sets that lets the entire admin task
run even when the interface is marked down/IFF_DRV_RUNNING isn't set.

With this change, nvmupdate should function like it did pre-iflib.

Reviewed by: gallatin@, sbruno@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15575


# f6cb0dea 19-May-2018 Matt Macy <mmacy@FreeBSD.org>

net: fix uninitialized variable warning


# 46d0f824 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

net: fix set but not used


# 2aa6f526 16-May-2018 Matt Macy <mmacy@FreeBSD.org>

Fix !netmap build post r333686

Approved by: sbruno


# 5ee36c68 16-May-2018 Stephen Hurd <shurd@FreeBSD.org>

Work around lack of TX IRQs in iflib for netmap

When poll() is called via netmap, txsync is initially called,
and if there are no available buffers to reclaim, it waits for the driver
to notify of new buffers. Since the TX IRQ is generally not used in iflib
drivers, this ends up causing a timeout.

Work around this by having the reclaim DELAY(1) if it's initially unable
to reclaim anything, then schedule the tx task, which will spin by
continuously rescheduling itself until some buffers are reclaimed. In
general, the delay is enough to allow some buffers to be reclaimed, so
spinning is minimized.

Reported by: Johannes Lundberg <johalun0@gmail.com>
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15455


# 09f6ff4f 11-May-2018 Matt Macy <mmacy@FreeBSD.org>

iflib(9): Add support for cloning pseudo interfaces

Part 3 of many ...
The VPC framework relies heavily on cloning pseudo interfaces
(vmnics, vpc switch, vcpswitch port, hostif, vxlan if, etc).

This pulls in that piece. Some ancillary changes get pulled
in as a side effect.

Reviewed by: shurd@
Approved by: sbruno@
Sponsored by: Joyent, Inc.
Differential Revision: https://reviews.freebsd.org/D15347


# ac88e6da 08-May-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: print message when iflib_tx_structures_setup fails

Print a message when iflib_tx_structures_setup fails, like we do for
iflib_rx_structures_setup.

Now that we always print a message from within
iflib_qset_structures_setup when it fails, stop printing one in
iflib_device_register() at the call site.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15300


# 6108c013 08-May-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: cleanup queues when iflib_device_register fail

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15299


# 1f7ce05d 07-May-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Fix an off-by-one error when deciding to request a tx interrupt

The canonical check for whether or not a ring is drainable is
TXQ_AVAIL() > MAX_TX_DESC() + 2. Use this same construct here,
in order to avoid a potential off-by-one error where we might otherwise
fail to request an interrupt.

Reviewed by: mmacy
Sponsored by: Netflix


# 94618825 05-May-2018 Mark Johnston <markj@FreeBSD.org>

Add netdump support to iflib.

em(4) and igb(4) were tested by me, and ixgbe(4) and bnxt(4) were
tested by sbruno.

Reviewed by: mmacy, shurd
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15262


# 1ae4848c 04-May-2018 Matt Macy <mmacy@FreeBSD.org>

fix gcc8 warnings

Approved by: sbruno


# b89827a0 04-May-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: fix invalid free during queue allocation failure

In r301567, code was added to cleanup to prevent memory leaks for the
Tx and Rx ring structs. This code carefully tracked txq and rxq, and
made sure to free them properly during cleanup.

Because we assigned the txq and rxq pointers into the ctx->ifc_txqs and
ctx->ifc_rxqs, we carefully reset these pointers to NULL, so that
cleanup code would not accidentally free the memory twice.

This was changed by r304021 ("Update iflib to support more NIC designs"),
which removed this resetting of the pointers to NULL, because it re-used
the txq and rxq pointers as an index into the queue set array.

Unfortunately, the cleanup code was left alone. Thus, if we fail to
allocate DMA or fail to configure the queues using the drivers ifdi
methods, we will attempt to free txq and rxq. These variables would now
incorrectly point to the wrong location, resulting in a page fault.

There are a number of methods to correct this, but ultimately the root
cause was that we reuse the txq and rxq pointers for two different
purposes.

Instead, when allocating, store the returned pointer directly into
ctx->ifc_txqs and ctx->ifc_rxqs. Then, assign this to txq and rxq as
index pointers before starting the loop to allocate each queue.
Drop the cleanup code for txq and rxq, and only use ctx->ifc_txqs and
ctx->ifc_rxqs.

Thus, we no longer need to free txq or rxq under any error flow, and
intsead rely solely on the pointers stored in ctx->ifc_txqs and
ctx->ifc_rxqs. This prevents the invalid free(), and ensures that we
still properly cleanup after ourselves as before when failing to
allocate.

Submitted by: Jacob Keller
Reviewed by: gallatin, sbruno
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15285


# 4d613f5d 04-May-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: remove unused brscp pointer from iflib_queues_alloc

This pointer was no longer written to as of r315217. Since nothing writes
to the variable, remove it.

Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin, kmacy, sbruno
Differential Revision: https://reviews.freebsd.org/D15284


# aa8a24d3 03-May-2018 Stephen Hurd <shurd@FreeBSD.org>

Allow iflib NIC drivers to sleep rather than busy wait

Since the move to SMP NIC driver locking has had to go through serious
contortions using mtx around long running hardware operations. This moves
iflib past that.

Individual drivers may now sleep when appropriate.

Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14983


# f7594707 30-Apr-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Fix iflib_encap() EFBIG handling bugs

1) Don't give up if m_collapse() fails. Rather than giving up, try
m_defrag() immediately.

2) Fix a leak where, if the NIC driver rejected the defrag'ed chain
as having too many segments, we would fail to free the chain.

Reviewed by: Matthew Macy <mmacy@mattmacy.io> (this version of patch)
Submitted by: Matthew Macy <mmacy@mattmacy.io> (early version of leak fix)


# 0b75ac77 18-Apr-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: Fix queue distribution when there are no threads

Previously, if there are no threads, all queues which targeted
cores that share an L2 cache were bound to a single core. The intent is
to distribute them across these cores.

Reported by: olivier
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15120


# 7b610b60 12-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Restore r332389 after resolution of locking fixes.

Add one extra lock initialization to iflib_register() that was missed
in the git<->phab conversion.

Split out flag manipulation from general context manipulation in iflib

To avoid blocking on the context lock in the swi thread and risk potential
deadlocks, this change protects lighter weight updates that only need to
be consistent with each other with their own lock.

Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14967


# 2ff91c17 12-Apr-2018 Vincenzo Maffione <vmaffione@FreeBSD.org>

netmap: align codebase to the current upstream (commit id 3fb001303718146)

Changelist:
- Turn tx_rings and rx_rings arrays into arrays of pointers to kring
structs. This patch includes fixes for ixv, ixl, ix, re, cxgbe, iflib,
vtnet and ptnet drivers to cope with the change.
- Generalize the nm_config() callback to accept a struct containing many
parameters.
- Introduce NKR_FAKERING to support buffers sharing (used for netmap
pipes)
- Improved API for external VALE modules.
- Various bug fixes and improvements to the netmap memory allocator,
including support for externally (userspace) allocated memory.
- Refactoring of netmap pipes: now linked rings share the same netmap
buffers, with a separate set of kring pointers (rhead, rcur, rtail).
Buffer swapping does not need to happen anymore.
- Large refactoring of the control API towards an extensible solution;
the goal is to allow the addition of more commands and extension of
existing ones (with new options) without the need of hacks or the
risk of running out of configuration space.
A new NIOCCTRL ioctl has been added to handle all the requests of the
new control API, which cover all the functionalities so far supported.
The netmap API bumps from 11 to 12 with this patch. Full backward
compatibility is provided for the old control command (NIOCREGIF), by
means of a new netmap_legacy module. Many parts of the old netmap.h
header has now been moved to netmap_legacy.h (included by netmap.h).

Approved by: hrs (mentor)


# 66def526 11-Apr-2018 Mateusz Guzik <mjg@FreeBSD.org>

iflib: fix up a mismerge in r332419

Lead to crashes on boot while in ifconfig.

Submitted by: Matthew Macy <mmacy@mattmacy.io>


# 90d72813 11-Apr-2018 Stephen Hurd <shurd@FreeBSD.org>

Properly initialize ifc_nhwtxqs.

Also, since ifc_nhwrxqs is only used in one place, remove it from the struct.
This was preventing iflib_dma_free() from being called via
iflib_device_detach().

Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks


# 7feb8819 11-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r332389 as it is causing panics for various users and we need
to add some more test cases.


# 5c1d8c4b 10-Apr-2018 Stephen Hurd <shurd@FreeBSD.org>

Split out flag manipulation from general context manipulation in iflib

To avoid blocking on the context lock in the swi thread and risk potential
deadlocks, this change protects lighter weight updates that only need to
be consistent with each other with their own lock.

Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14967


# 541d96aa 30-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Use an accessor function to access ifr_data.

This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14900


# 18628b74 25-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Clamp IFLIB_RX_COPY_THRESH to MHLEN in iflib_rxd_pkt_get().

If one has added fields to struct mbuf such that MHLEN is smaller than
this threshold (128), iflib_rxd_pkt_get() may otherwise overrun the
internal mbuf buffer while copying.

Reviewed by: mmacy
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14843


# 226fb85d 02-Mar-2018 Stephen Hurd <shurd@FreeBSD.org>

iflib: stop timer callout when stopping

iflib_timer has been seen running after the interface had been removed.
This change prevents that.

Submitted by: matt.macy@joyent.com


# 7cb7c6e3 20-Feb-2018 Navdeep Parhar <np@FreeBSD.org>

Catch up with the removal of nktr_slot_flags from upstream netmap. No
functional impact intended.

Submitted by: Vincenzo Maffione <v.maffione@gmail.com>


# a4e59607 20-Feb-2018 Stephen Hurd <shurd@FreeBSD.org>

IFLIB: do not remove dmamap on buffer unload

Dmamap is created only on IFC attach. If we remove it on
buffer release, we won't be able to do ifconfig down&up. Only destroy
when in detach.

Reported by: wma
Reviewed by: wma
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14060


# ac2fffa4 21-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by: wosch
PR: 225197


# 44313341 15-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

net*: make some use of mallocarray(9).

Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837


# 9c58cafa 27-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Don't pass rids to taskqgroup_attach()

As everywhere else, we want to pass rman_get_start(irq->ii_res). This
caused set affinity errors when not using MSI-X vectors (legacy and MSI
interrupts).

Reported by: sbruno
Sponsored by: Limelight Networks


# ca03863c 27-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Remove assertion that's not true for !EARLY_AP_STARTUP

gtask->gt_taskqueue is NULL when EARLY_AP_STARTUP is not enabled.
Remove assertion to allow this config to work.

Reported by: oleg
Sponsored by: Limelight Networks


# de130954 27-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix indentation.

Sponsored by: Limelight Networks


# 97755e83 21-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Fix build for kernels with SCHED_4BSD.

Sponsored by: The FreeBSD Foundation


# 25ac1dd5 20-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Don't call tcp_lro_rx() unless hardware verified TCP/UDP csum

It seems that tcp_lro_rx() doesn't verify TCP checksums, so
if there are bad checksums in the packets caused by invalid data, the
invalid data will pass through without errors.

This was noticed with the igb driver and a specific internet host:
fetch http://www.mpfr.org/mpfr-current/mpfr-3.1.6.tar.xz -o test.bin && sha256 test.bin
Would result in a different value sometimes.

This ends up making LRO require RXCSUM to be enabled, and RXCSUM to
support TCP and UDP checksums.

PR: 224346
Reported by: gjb
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13561


# 40cf51c4 19-Dec-2017 Li-Wen Hsu <lwhsu@FreeBSD.org>

Add missing `;`

Approved by: kevlo


# b103855e 19-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Support attaching tx queues to cpus

This will attempt to use a different thread/core on the same L2
cache when possible, or use the same cpu as the rx thread when not.
If SMP isn't enabled, don't go looking for cores to use. This is mostly
useful when using shared TX/RX queues.

Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12446


# 96fc97c8 19-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Update Matthew Macy contact info

Email address has changed, uses consistent name (Matthew, not Matt)

Reported by: Matthew Macy <mmacy@mattmacy.io>
Differential Revision: https://reviews.freebsd.org/D13537


# 06c47d48 11-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Increment encap_pad_mbuf_fail when m_dup() fails in padding

Previously, the counter was only incremented when m_append() failed. Since
the function can also fail on m_dup() now, increment the counter there as
well.

Sponsored by: Limelight Networks


# 04993890 08-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Free mbuf chain when m_dup fails

Fix memory leak where mbuf chain wasn't free()d if iflib_ether_pad()
has a failure in m_dup().

Reported by: "Ryan Stone" <rysto32@gmail.com>
Sponsored by: Limelight Networks


# a15fbbb8 08-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Handle read-only mbufs in iflib ether pad function

If ethernet padding is enabled, and a read-only mbuf is passed,
it would modify the mbuf using m_append(). Instead, call m_dup() and
append to the new packet.

Reported by: Pyun YongHyeon
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13414


# d14c853b 05-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

iflib: Support to padding Ethernet frames to a min size

Some bnxt devices do not correctly send frames smaller than
52 bytes (without CRC), so add a quirk that will pad frames to an
arbitrary size before passing off to the encap routine.

Reported by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13269


# fe1bcada 05-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Avoid calling CURVNET_[SET|RESTORE] for each packet

The LRO possible test was calling CURVNET_SET once for IPv4 or IPv6 for
each packet in a chain. Only call it once per chain instead.

Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: cem, ae
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13368


# a027c8e9 01-Dec-2017 Stephen Hurd <shurd@FreeBSD.org>

Add support for SIOCGIFXMEDIA to iflib

SIOCGIFXMEDIA is required for extended ethernet media types,
but iflib did not support it.

Reported by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13312


# 772593db 29-Nov-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix comment introduced in r326369

The code uses the set of all CPUs, it doesn't zero out the set.

Sponsored by: Limelight Networks


# e516b535 29-Nov-2017 Stephen Hurd <shurd@FreeBSD.org>

Ensure that ctx->ifc_cpus is always initialized

If a device didn't support MSI-X, ctx->ifc_cpus would not be initialized,
but the IRQ allocation routines still uses the value. Move the
initialization to common code.

Sponsored by: Limelight Networks


# 7274b2f6 20-Nov-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix off-by-one error in bit_nclear() usage

bit_nclear() takes the bit numbers for the start and end bits, not the start
and a count. This was resulting in memory corruption past the end of the
bitstr_t.

Sponsored by: Limelight Networks


# d2735264 16-Nov-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix default numbers of iflib queue sets

The intent appears to be having one RX/TX queue set per core,
but since scctx->isc_n[tr]xqsets is set to max before calling
iflib_msix_init(), both end up being set to total number of cores.

Use ctx->ifc_sysctl_n[rt]xqs as the selected value and
scctx->isc_n[rt]xqsets as the max. This should result in what appears
to be the intended behaviour

Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13096


# abec4724 06-Nov-2017 Sean Bruno <sbruno@FreeBSD.org>

Fix NOINET/NOINET6 build during compilation of iflib.

Reported by: kib


# 35e4e998 06-Nov-2017 Stephen Hurd <shurd@FreeBSD.org>

Only chain non-LRO mbufs when LRO is not possible

Preserve packet order between tcp_lro_rx() and if_input() to avoid
creating extra corner cases. If no packets can be LROed, combine them
into one chain for submission via if_input(). If any packet can
potentially be LROed however, retain old behaviour and call if_input()
for each packet.

This should keep the 12% improvement for small packet forwarding intact,
but mostly avoids impacting the LRO case.

Reviewed by: cem, sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12876


# 0b6c52b6 31-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Preserve TSO checksum flags

r323941 incorrectly disabled TSO flags based on MTU.

Reported by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12880


# a1b799ca 31-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix PR221990 - Assertion at iflib.c:1947

ifl_pidx and ifl_credits are going out of sync in _iflib_fl_refill() as they
use different update log. Use the same update logic for both, and add a
final call to isc_rxd_refill() to handle early exits from the loop.

PR: 221990
Reported by: pho
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12798


# 10e0d938 30-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix build with nodevice netmap

iru_init() was declared and used outside the DEV_NETMAP
conditional blocks, but was implemented inside one. Move the
implementation out of the DEV_NETMAP block to allow building with
netmap disabled.

Reported by: Andrew Turner <andrew@fubar.geek.nz>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12842


# 09b57b7f 30-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

bnxt: HW_LRO Rx Pkt with > 32 fragments caused Crash (iflib)

Broadcom NIC with HW_LRO setting max_agg_segs >= 6 can generate Rx pkt with
64 (2^6) fragments, modify IFLIB_MAX_RX_SEGS to 64 to avoid memory
corruption / Crash.

Submitted by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Reviewed by: shurd, sbruno
Approved by: sbruno (mentor)
Sponsored by: Broadcom Limited
Differential Revision: https://reviews.freebsd.org/D12774


# 2d873474 30-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix PR222744 - netmap errors with iflib em driver

Fix error when refilling netmap buffers that resulted in the first
buffer of the successive passes through ifl_bus_addrs[] leaving the
first value unset (tmp_pidx started at 1, not zero after the first time
through the loop).

Leave the one unused buffer required by some NICs visible in the netmap
ring rather than hidden. There will always be a buffer in use by the
kernel now when an iflib driver is used via netmap.

Always get the netmap slot index via netmap_idx_n2k() to account for
nkr_hwofs in a consistent way.

Split shared functionality into new functions.
iru_init(): shared by _iflib_fl_refill() and netmap_fl_refill()
netmap_fl_refill(): shared by iflib_netmap_rxsync() and
iflib_netmap_rxq_init()

PR: 222744
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12769


# 0fdea539 30-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Avoid enabling MSI-X if MSI-X is disabled globally

It was reported on the community call that with
hw.pci.enable_msix=0, iflib would enable MSI-X on the device and attempt
to use it, which caused issues. Test the sysctl explicitly and do not
enable MSI-X if it's disabled globally.

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12805


# 3429c02f 23-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Some cache related optimizations

1. prefetch 128 bytes of mbufs.
2. Re-order filling the pkt_info so cache stalls happen at the end
3. Define empty prefetch2cachelines() macro when the function isn't present.

Provides small performance improvments on some hardware

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12447


# 1c0054d2 05-Oct-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix "taskqgroup_attach: setaffinity failed: 3" with iflib drivers

Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12496


# 1225d9da 23-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Have ifmp_ring_enqueue() abdicate instead of switch to a consumer

Move TX out of the enqueue() path. As a result, we need
to have ifmp_ring_check_drainage() pick up from the abdicate state.

We also need to either enqueue the TX task, or check drainage
after calling ifmp_ring_enqueue() to ensure it's sent.

This change results in a 30% small packet forwarding improvement.

Reviewed by: olivier, sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12439


# f4d2154e 22-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Make the rx budget a tunable

This allows tuning the rx budget for special load profiles
as well as more easily testing to determine sane defaults.

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12445


# 20f63282 22-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Chain mbufs before passing to if_input()

Build a list of mbufs to pass to if_input() after LRO. Results in
12% small packet forwarding rate improvement.

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12444


# c5cf2172 22-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Some small packet performance improvements

If the packet is smaller than MTU, disable the TSO flags.
Move TCP header parsing inside the IS_TSO?() test.
Add a new IFLIB_NEED_ZERO_CSUM flag to indicate the checksums need to be zeroed before TX.

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12442


# d0d0ad0a 20-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Fix iflib netmap RX

RXQ setup for netmap was broken because netmap_rxq_init was getting called
before IFDI_INIT - thus we ended up with ring tail pointer being reset to zero.

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12140


# ab2e3f79 15-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Revert r323516 (iflib rollup)

This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.

I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.

Approved by: sbruno (mentor)
Sponsored by: Limelight Networks


# d300df01 12-Sep-2017 Stephen Hurd <shurd@FreeBSD.org>

Roll up iflib commits from github. This pulls in most of the work done
by Matt Macy as well as other changes which he has accepted via pull
request to his github repo at https://github.com/mattmacy/networking/

This should bring -CURRENT and the github repo into close enough sync to
allow small feature branches rather than a large chain of interdependant
patches being developed out of tree. The reset of the synchronization
should be able to be completed on github by splitting the remaining
changes that are not yet ready into short feature branches for later
review as smaller commits.

Here is a summary of changes included in this patch:

1) More checks when INVARIANTS are enabled for eariler problem
detection
2) Group Task Queue cleanups
- Fix use of duplicate shortdesc for gtaskqueue malloc type.
Some interfaces such as memguard(9) use the short description to
identify malloc types, so duplicates should be avoided.
3) Allow gtaskqueues to use ithreads in addition to taskqueues
- In some cases, this can improve performance
4) Better logging when taskqgroup_attach*() fails to set interrupt
affinity.
5) Do not start gtaskqueues until they're needed
6) Have mp_ring enqueue function enter the ABDICATED rather than BUSY
state. This moves the TX to the gtaskq and allows processing to
continue faster as well as make TX batching more likely.
7) Add an ift_txd_errata function to struct if_txrx. This allows
drivers to inspect/modify mbufs before transmission.
8) Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need
checksums zeroed for checksum offload to work. This avoids modifying
packet data in the TX path when possible.
9) Use ithreads for iflib I/O instead of taskqueues
10) Clean up ioctl and support async ioctl functions
11) Prefetch two cachlines from each mbuf instead of one up to 128B. We
often need to parse packet header info beyond 64B.
12) Fix potential memory corruption due to fence post error in
bit_nclear() usage.
13) Improved hang detection and handling
14) If the packet is smaller than MTU, disable the TSO flags.
This avoids extra packet parsing when not needed.
15) Move TCP header parsing inside the IS_TSO?() test.
This avoids extra packet parsing when not needed.
16) Pass chains of mbufs that are not consumed by lro to if_input()
rather call if_input() for each mbuf.
17) Re-arrange packet header loads to get as much work as possible done
before a cache stall.
18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/
IFDI_DETACH();
19) Attempt to distribute RX/TX tasks across cores more sensibly,
especially when RX and TX share an interrupt. RX will attempt to
take the first threads on a core, and TX will attempt to take
successive threads.
20) Allow iflib_softirq_alloc_generic() to request affinity to the same
cpus an interrupt has affinity with. This allows TX queues to
ensure they are serviced by the socket the device is on.
21) Add new iflib sysctls to net.iflib:
- timer_int - interval at which to run per-queue timers in ticks
- force_busdma
22) Add new per-device iflib sysctls to dev.X.Y.iflib
- rx_budget allows tuning the batch size on the RX path
- watchdog_events Count of watchdog events seen since load
23) Fix error where netmap_rxq_init() could get called before
IFDI_INIT()
24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY
when waiting for firmware
- After interrupts are enabled, convert all waits to sleeps
- Eliminates e1000 software/firmware synchronization busy waits after
startup
25) e1000: Remove special case for budget=1 in em_txrx.c
- Premature optimization which may actually be incorrect with
multi-segment packets
26) e1000: Split out TX interrupt rather than share an interrupt for
RX and TX.
- Allows better performance by keeping RX and TX paths separate
27) e1000: Separate igb from em code where suitable
Much easier to understand separate functions and "if (is_igb)" than
previous tests like "if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))"

#blamebruno

Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12235


# 2cc3b2ee 31-Aug-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Do not abuse flag that is clearly marked as unused.
This creates conflicts with FreeBSD variations that may use it. The
usage of the flag M_TOOBIG is limited to iflib queue, thus using
one of M_PROTO flags is fine. There is no need to grab global flag.

Silence from: kmacy, sbruno (2 weeks)


# a9693502 30-Aug-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert r323008 and its conversion of e1000/iflib to using SX locks.

This seems to be missing something on the 82574L causing NFS root mounts
to hang.

Reported by: kib


# e17e5b41 29-Aug-2017 Sean Bruno <sbruno@FreeBSD.org>

Continuation of lock cleanup in e1000.

Post-cold sleep instead of DELAY when waiting for firmware.

Convert softc mutex to an SX lock. Change all waits to sleeps
once interrupts are enabled (and it is safe to sleep).

Submitted by: Matt Macy <matt@mattmacy.io>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12101


# 21e10b16 23-Aug-2017 Sean Bruno <sbruno@FreeBSD.org>

iflib: call device's if_init function during vlan initialization.

Submitted by: bhargava.marreddy@broadcom.com
Reviewed by: shurd
Sponsored by: Broadcom
Differential Revision: https://reviews.freebsd.org/D12098


# 5c5ca36c 09-Aug-2017 Sean Bruno <sbruno@FreeBSD.org>

Don't leak mbufs if clusers exceeds the number of segments. This would
leak mbufs over time causing crashes.

PR: 221202
Submitted by: Matt Macy <matt@mattmacy.io>
Reported by: gergely.czuczy@harmless.hu
Sponsored by: Limelight Networks


# 18a660b3 09-Aug-2017 Sean Bruno <sbruno@FreeBSD.org>

Export IFCAP_HWSTATS so that we don't experience double stats counting
on iflib enabled devices.

PR: 220198
Submitted by: Matt Macy <matt@mattmacy.io>
Reported by: Ben Woods <woodsb02@freebsd.org>
Sponsored by: Limelight Networks


# 9d35858f 27-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

Slight restructure of iflib_busdma_load_mbuf_sg() to fix accounting
when m_collapse() fails.

Submitted by: krzystof.galazka@intel.com
Reviewed by: Jeb Cramer <cramerj@intel.com>
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D11476


# 9d0a88de 20-Jul-2017 Dimitry Andric <dim@FreeBSD.org>

Fix printf format warning in iflib.c

Clang 5.0.0 got better warnings about printf format strings using %zd,
and this leads to the following -Werror warning on e.g. arm:

sys/net/iflib.c:1517:8: error: format specifies type 'ssize_t' (aka 'int') but the argument has type 'bus_size_t' (aka 'unsigned long') [-Werror,-Wformat]
sctx->isc_tx_maxsize, nsegments, sctx->isc_tx_maxsegsize);
^~~~~~~~~~~~~~~~~~~~
sys/net/iflib.c:1517:41: error: format specifies type 'ssize_t' (aka 'int') but the argument has type 'bus_size_t' (aka 'unsigned long') [-Werror,-Wformat]
sctx->isc_tx_maxsize, nsegments, sctx->isc_tx_maxsegsize);
^~~~~~~~~~~~~~~~~~~~~~~

Fix this by casting bus_size_t arguments to uintmax_t, and using %ju
instead.

Reviewed by: emaste
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D11679


# dcbc025f 19-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

Don't cache mbuf pointers if the number of descriptors is greater than
the number of buffers.

Submitted by: Matt Macy <mmacy@mattmacy.io>
Sponsored by: Limelight Networks


# 25d52811 03-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

iflib - flib_busdma_load_mbuf_sg used isc_tx_maxsize as max semgent size.

Submitted by: krzysztof.galazka@intel.com
Differential Revision: https://reviews.freebsd.org/D11403


# 87890dba 03-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

bnxt(4) Enable LRO support, redux

iflib - reset fl-ifl_fragidx to 0 on iflib_fl_bufs_free(). This caused the
panic in em/igb when adding it to a bridge device.

iflib - Handle out of order packet delivery from hardware in support of LRO

Out of order updates to rxd's is fixed in r315217. However, it is not
completely fixed. While refilling the buffers, iflib is not considering
the out of order descriptors. Hence, it is refilling sequentially.
"idx" variable in _iflib_fl_refill routine is incremented sequentially.
By doing refilling sequentially, it will override the SGEs that
are *IN USE* by other connections. Fix is to maintain a bitmap of
rx descriptors and differentiate the used one with unused one and
refill only at the unused indices. This patch also fixes a
few bugs in bnxt, related to the same feature.

Submitted by: bhargava.marreddy@broadcom.com
Reviewed by: venkatkumar.duvvuru@broadcom.com shurd
Differential Revision: https://reviews.freebsd.org/D10681


# fa5416a8 17-Jun-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert r319989 "bnxt(4) Enable LRO support"

This generates startup LORs and panics when adding elements to bridge
devices. I will document further in https://reviews.freebsd.org/D10681

PR: 220073
Submitted by: dchagin
Reported by: db


# 51a621f7 15-Jun-2017 Sean Bruno <sbruno@FreeBSD.org>

bnxt(4) Enable LRO support

iflib - Handle out of order packet delivery from hardware in support of LRO

Out of order updates to rxd's is fixed in r315217. However, it is not
completely fixed. While refilling the buffers, iflib is not considering
the out of order descriptors. Hence, it is refilling sequentially.
"idx" variable in _iflib_fl_refill routine is incremented sequentially.
By doing refilling sequentially, it will override the SGEs that
are *IN USE* by other connections. Fix is to maintain a bitmap of
rx descriptors and differentiate the used one with unused one and
refill only at the unused indices. This patch also fixes a
few bugs in bnxt, related to the same feature.

Submitted by: bhargava.marreddy@broadcom.com
Reviewed by: shurd@
Differential Revision: https://reviews.freebsd.org/D10681


# 3600bd1e 15-Jun-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert r319921 which seems to cause NFS booting assertion panics in
various configurations.

Reported by: pho@


# f7587db0 13-Jun-2017 Sean Bruno <sbruno@FreeBSD.org>

Add new sysctl to allow changing of timing of the txq timers.

Add new sysctl to override use of busdma in the driver.

Submitted by: Drew Gallitin <gallatin@netflix.com>


# aa8fa07c 13-Jun-2017 Sean Bruno <sbruno@FreeBSD.org>

Plug mbuf leak in the busdma path of iflib.

Submitted by: Michael Tuexen <tuexen@freebsd.org>
Reported by: Drew Gallitin <gallatin@netflix.com>


# 1d898b91 09-May-2017 Bjoern A. Zeeb <bz@FreeBSD.org>

Adjust a comment.

MFC after: 3 days


# 60596476 06-Apr-2017 Sean Bruno <sbruno@FreeBSD.org>

Move pause frame counter out of struct if_ctx and into struct if_softc_ctx_t
so that we can use it in iflib to detect pause frames.

The igb(4) driver definitely used to use this in its old timer function and
I see no reason to restrict it to that driver only.

Sponsored by: Limelight Networks


# ea351d3f 04-Apr-2017 Sean Bruno <sbruno@FreeBSD.org>

Allow MSIX to be turned off by tuneable per interface, per driver.

Sponsored by: Limelight Networks


# 2b2fc973 30-Mar-2017 Sean Bruno <sbruno@FreeBSD.org>

Don't call init functions directly from the timer/watchdog function.

Enqueue this in the admin task now that it can process it.

Submitted by: Matt Macy <mmacy@nextbsd.org>
Sponsored by: Limelight Networks


# 5c1ff255 30-Mar-2017 Sean Bruno <sbruno@FreeBSD.org>

Assert IFF_DRV_OACTIVE in iflib_timer() when the "hung" case is detected
so that iflib's admin task can still process the reset directive and restore
functionality.

Sponsored by: Limelight Networks


# 5e888388 14-Mar-2017 Sean Bruno <sbruno@FreeBSD.org>

Change casting to a uintptr_t to be compatible with non-x86 architectures.

Submitted by: Matt Macy <mmacy@nextbsd.org>
Reported by: rpokala
Sponsored by: Limelight Networks


# 0a1b74a3 14-Mar-2017 Sean Bruno <sbruno@FreeBSD.org>

Fixup LINT by using uint64_t type as we do on all other calls to PNMB()

Found with Jenkins.

Reported by: lwshu
Sponsored by: Limelight Networks


# 95246abb 13-Mar-2017 Sean Bruno <sbruno@FreeBSD.org>

IFLIB updates
- unconditionally enable BUS_DMA on non-x86 architectures
- speed up rxd zeroing via customized function
- support out of order updates to rxd's
- add prefetching to hardware descriptor rings
- only prefetch on 10G or faster hardware
- add seperate tx queue intr function
- preliminary rework of NETMAP interfaces, WIP

Submitted by: Matt Macy <mmacy@nextbsd.org>
Sponsored by: Limelight Networks


# d945ed64 01-Mar-2017 Sean Bruno <sbruno@FreeBSD.org>

Make gtaskqueue compatible with drm-next such that they can be used with the
linuxkpi tasklets.

Submitted by: mmacy@nextbsd.org
Reported by: hps


# e099b90b 21-Feb-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: Replace zero with NULL for pointers.

Found with: devel/coccinelle
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D9694


# 67af525c 04-Feb-2017 Sean Bruno <sbruno@FreeBSD.org>

Delete duplicate break.


# 835809f9 28-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Fix i386 compile failure by moving needed closing parenthesis out of
conditional block.

Submitted by: hiren
Reported by: cy


# e035717e 27-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

IFLIB updates:
We found routing performance dropped significantly when configuring
FreeBSD as a router, we are applying the following changes in order to
resolve those issues and hopefully perform better.
- don't prefetch the flags array, we usually don't need it
- prefetch the next cache line of each of the software descriptor arrays as
well as the first cache line of each of the next four packets' mbufs and
clusters
- reduce max copy size to 63 bytes
- convert rx soft descriptors from array of structures to a structure of arrays
- update copyrights

Submitted by: Matt Macy <mmacy@nextbsd.org>


# 96eeabef 27-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Replace customized busmaster code with standardized setup call.

Reported by: jhb


# 69b7fc3e 26-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Minor style annoyance.

Submitted by: bde


# f7ae9a84 25-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Add error checking to the pci_find_cap(, PCIY_MSIX,) call that is returns
success and a good value. Only then try to use it and set the MSIX_ENABLE
bit.

With the current em(4) driver we have observed failures in this case in a
specific environment when pci_find_cap() would not return the assumed
value, which meant we ended up writing to PCI register 2 (PCI_DEVICE_ID)
which is read-only.

PR: 216456
Submitted by: bz


# bd84f700 24-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

iflib:
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.

e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.

Submitted by: bde
Reported by: Matt Macy <mmacy@nextbsd.org>


# 36fa5d5b 24-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Revert 312696 due to build tests.


# 562a3182 24-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

iflib:
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.

e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.

Submitted by: bde
Reviewed by: Matt Macy <mmacy@nextbsd.org>


# 4ecb427a 14-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Fix hangs in a uniprocessor configuration (qemu, virtualbox, real hw).

sys/net/iflib.c:
Add ctx to filter_info and don't skpi interrupt early on unless we're on an
SMP system

sys/kern/subr_gtaskqueue.c:
Skip smp check if we're running UP

Submitted by: Matt Macy <mmacy@nextbsd.org>
Reported by: emaste bde


# 5b51fcfc 09-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

Remove unused mtx_held() macro.


# 1248952a 01-Jan-2017 Sean Bruno <sbruno@FreeBSD.org>

2017 IFLIB updates in preparation for commits to e1000 and ixgbe.
- iflib - add checksum in place support (mmacy)
- iflib - initialize IP for TSO (going to be needed for e1000) (mmacy)
- iflib - move isc_txrx from shared context to softc context (mmacy)
- iflib - Normalize checks in TXQ drainage. (shurd)
- iflib - Fix queue capping checks (mmacy)
- iflib - Fix invalid assert, em can need 2 sentinels (mmacy)
- iflib - let the driver determine what capabilities are set and what
tx csum flags are used (mmacy)
- add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy)
- update bnxt(4) to support the changes to iflib (shurd)

Some other various, sundry updates. Slightly more verbose changelog:

Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
mFC after:
Sponsored by: LimeLight Networks and Dell EMC Isilon


# da69b8f9 17-Nov-2016 Sean Bruno <sbruno@FreeBSD.org>

iflib updates and fixes:
- reset gen on down
- initialize admin task statically
- drain mp_ring on down
- don't drop context lock on stop
- reset error stats on down
- fix typo in min_latency sysctl
- return ENOBUFS from if_transmit if the driver isn't running or the link is down

Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
MFC after: 2 days
Sponsored by: Isilon and Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8558


# 2fe66646 18-Oct-2016 Sean Bruno <sbruno@FreeBSD.org>

Set default capabilities at attach.

ref: https://github.com/NextBSD/NextBSD/commit/6425f45e5fc89f64925995bbcfc09c7558d896ea

Submitted by: mmacy@nextbsd.org


# add6f7d0 18-Oct-2016 Sean Bruno <sbruno@FreeBSD.org>

When deciding whether or not to call tqg_attach_cpu(), reference rid
directly.

ref: https://github.com/NextBSD/NextBSD/commit/c9b47b468b8a3350811acfd9e167a8b91dc8f0c6

Submitted by: mmacy@nextbsd.org


# 8b2a1db9 18-Oct-2016 Sean Bruno <sbruno@FreeBSD.org>

Toggle v4/v6 rxcsum together

Only re-init if driver is running

ref: https://github.com/NextBSD/NextBSD/commit/106518e874ec9a61daf4c09894170d24e2f4d60d

Submitted by: mmacy@nextbsd.org


# aa3c5dd8 18-Oct-2016 Sean Bruno <sbruno@FreeBSD.org>

Fix misusage of CPU_FFS when binding queues to cpus

ref: https://github.com/NextBSD/NextBSD/commit/922d0bdf2277f30954f143107d2a3eddb02abd2d

Submitted by: mmacy@nextbsd.org


# 23ac9029 12-Aug-2016 Stephen Hurd <shurd@FreeBSD.org>

Update iflib to support more NIC designs

- Move group task queue into kern/subr_gtaskqueue.c
- Change intr_enable to return an int so it can be detected if it's not
implemented
- Allow different TX/RX queues per set to be different sizes
- Don't split up TX mbufs before transmit
- Allow a completion queue for TX as well as RX
- Pass the RX budget to isc_rxd_available() to allow an earlier return
and avoid multiple calls

Submitted by: shurd
Reviewed by: gallatin
Approved by: scottl
Differential Revision: https://reviews.freebsd.org/D7393


# f454e7eb 04-Aug-2016 John Baldwin <jhb@FreeBSD.org>

Add __printflike() to bus_describe_intr() to enable -Wformat checks.

Fix a few places that were passing a raw string as the format to use
a "%s" format string instead.

MFC after: 2 months


# 91d546a0 08-Jul-2016 Conrad Meyer <cem@FreeBSD.org>

iflib: Fix typo in 'iflib_rx_miss_bufs' sysctl name

It looks like these sysctls were copy-pasted from netmap. Most were changed
from 'ixl_' prefix to 'iflib_', but this one was missed.

Fix the "can't re-use a leaf (ixl_rx_miss_bufs)!" warning.

Reported by: dim@ and others
Sponsored by: EMC / Isilon Storage Division


# 96c85efb 06-Jul-2016 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)


# 0d0338af 07-Jun-2016 Conrad Meyer <cem@FreeBSD.org>

iflib: Improve cleanup on iflib_queues_alloc error path

Fix some memory leaks. Some may remain.

Reported by: Coverity
Discussed with: mmacy
CIDs: 1356036, 1356037, 1356038
Sponsored by: EMC / Isilon Storage Division


# 16fb86ab 07-Jun-2016 Conrad Meyer <cem@FreeBSD.org>

iflib: Fix potential leak in iflib_if_transmit

Due to an accidental mismatch between allocation and release in the slow path
of iflib_if_transmit, if a caller passed 9-16 mbufs to the routine, the mbuf
array would be leaked.

Fix the mismatch by removing the magic numbers in favor of nitems() on the
stack array. According to mmacy, this leak is unlikely.

Reported by: Coverity
Discussed with: mmacy
CID: 1356040
Sponsored by: EMC / Isilon Storage Division


# c7762913 18-May-2016 Scott Long <scottl@FreeBSD.org>

Remove assertions that don't make sense for the data type.


# aaeb188a 18-May-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Make compile without INET or without IP support in the kernel by hiding
variables and lro function calls behind approriate #ifdefs.

Also move the #includes for "opt_*" to the place where they should be.


# 4c7070db 17-May-2016 Scott Long <scottl@FreeBSD.org>

Import the 'iflib' API library for network drivers. From the author:

"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."

Participation is purely optional. The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation. ixl and ixgbe driver conversions will be committed
shortly. We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.

Submitted by: mmacy@nextbsd.org
Reviewed by: gallatin
Differential Revision: D5211