History log of /freebsd-current/sys/net/netisr.c
Revision Date Author Comments
# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 2c2b37ad 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move struct ifnet definition to a <net/if_private.h>

Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046


# 028ecc7a 03-Sep-2022 Gordon Bergling <gbe@FreeBSD.org>

netisr(9): Fix a typo in a source code comment

- s/overriden/overridden/

MFC after: 3 days


# 51f798e7 26-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netisr: serialize/restore m_pkthdr.rcvif when queueing mbufs

Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33268

(cherry picked from commit 6871de9363e559fef6765f0e49acc47f77544999)


# 0fa56369 03-May-2022 Marko Zec <zec@FreeBSD.org>

Revert "netisr: serialize/restore m_pkthdr.rcvif when queueing mbufs"

This reverts commit 6871de9363e559fef6765f0e49acc47f77544999.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex


# 6871de93 26-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netisr: serialize/restore m_pkthdr.rcvif when queueing mbufs

Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33268


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 9033ad5f 16-May-2020 Pawel Biernacki <kaktus@FreeBSD.org>

sysctl: fix setting net.isr.dispatch during early boot

Fix another collateral damage of r357614: netisr is initialised way before
malloc() is available hence it can't use sysctl_handle_string() that
allocates temporary buffer. Handle that internally in
sysctl_netisr_dispatch_policy().

PR: 246114
Reported by: delphij
Reviewed by: kib
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D24858


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# bacb11c9 17-Feb-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix kernel panic while trying to read multicast stream.

When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set
for all mbufs being input by the IGMP/MLD6 code. Else there will be a
NULL-pointer dereference in the netisr code when trying to set the
VNET based on the incoming mbuf. Add an assert to catch this when
queueing mbufs on a netisr to make debugging of similar cases easier.

Found by: Vladislav V. Prodan
PR: 244002
Reviewed by: bz@
MFC after: 1 week
Sponsored by: Mellanox Technologies


# 977b9472 31-Jan-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Revert r357293.
The netisr uses rm_ locks not rms_ locks as noted by jeff@ .

Sponsored by: Mellanox Technologies


# 780c568f 29-Jan-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Widen EPOCH(9) usage in netisr.

Software interrupt handlers are allowed to sleep. In swi_net() there
is a read lock behind NETISR_RLOCK() which in turn ends up calling
msleep() which means the whole of swi_net() cannot be protected by an
EPOCH(9) section. By default the NETISR_LOCKING feature is disabled.

This issue was introduced by r357004. This is a preparation step for
replacing the functionality provided by r357004.

Found by: kib@
Sponsored by: Mellanox Technologies


# 6ed3e187 22-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Mark swi_net() as INTR_TYPE_NET and stop entering epoch there.


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# fb3bc596 24-May-2019 John Baldwin <jhb@FreeBSD.org>

Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.

Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.

To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.

For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.

- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.

This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).

In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.

To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.

- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117


# 5f901c92 24-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147


# fe267a55 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: general adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.


# e2a8d178 18-Feb-2017 Jason A. Harmening <jah@FreeBSD.org>

Bring back r313037, with fixes for mips:

Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.

Reviewed by: andreast, kan, lidl
Tested by: lidl(mips, sparc64), andreast(powerpc)
Differential Revision: https://reviews.freebsd.org/D9587


# ad62ba6e 03-Feb-2017 Jason A. Harmening <jah@FreeBSD.org>

Revert r313037

The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.

Reported by: kan, lidl


# 65ed4836 31-Jan-2017 Jason A. Harmening <jah@FreeBSD.org>

Implement get_pcpu() for the remaining architectures and use it to
replace pcpu_find(curcpu) in MI code.


# fdf95c0b 17-Aug-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Teach netisr_get_cpuid() to limit a given value to supported by netisr.
Use netisr_get_cpuid() in netisr_select_cpuid() to limit cpuid value
returned by protocol to be sure that it is not greather than nws_count.

PR: 211836
Reviewed by: adrian
MFC after: 3 days


# 8c636a11 11-Jul-2016 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Remove assumptions in MI code that the BSP is CPU 0.

MFC after: 2 weeks


# 484149de 03-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by: gnn (, hiren)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6691


# fdce57a0 14-May-2016 John Baldwin <jhb@FreeBSD.org>

Add an EARLY_AP_STARTUP option to start APs earlier during boot.

Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.

This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.

This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.

However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.

Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.

As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.

These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).

PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix


# 8dfea464 21-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

Remove slightly used const values that can be replaced with nitems().

Suggested by: jhb


# 2f9b9f9c 04-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Remove an unneeded check.

CPUs with valid per-CPU data are not absent.

Sponsored by: Netflix


# 8ec07310 01-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h


# a9467c3c 25-Apr-2015 Hiren Panchasara <hiren@FreeBSD.org>

Currently there is no easy way to specify net.isr.maxthreads = all cpus. We need
to specify exact number of cpus in loader.conf which get annoying when you have
mix of machines which don't have equal number of total cpus. I propose "-1" as
that value. When loader.conf has net.isr.maxthreads = -1, netisr will use all
available cpus.

In collaboration with: davide
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D2318
MFC after: 2 weeks
Sponsored by: Limelight Networks


# 51d4054e 09-Apr-2015 George V. Neville-Neil <gnn@FreeBSD.org>

Revert 281276 as unnecessary. Proper change to be committed
to the base polling code in a subsequent commit.

Pointed out by: glebius

Sponsored by: Rubicon Communications (NetGate)


# 8a7ad101 08-Apr-2015 George V. Neville-Neil <gnn@FreeBSD.org>

Add support for a netisr polling tunable, which allows run time switching of
device polling rather than having it only be controlled by the compile
time option.

Summary: Rubicon Communications (Netgate)

Reviewers: #network, hiren

Reviewed By: #network, hiren

Subscribers: hiren

Differential Revision: https://reviews.freebsd.org/D2258


# c2529042 01-Dec-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after: 1 month
Sponsored by: Mellanox Technologies


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# da162ca8 26-Nov-2013 Sergey Kandaurov <pluknet@FreeBSD.org>

Fix macro name in comment.


# 933e681d 06-Sep-2013 Davide Italiano <davide@FreeBSD.org>

Retire netisr.netisr_direct and netisr.netisr_direct_force sysctls.
These were used to control/export dispatch policy but they're not anymore.
This commit cannot be MFC'ed to 9 because old netstat(9) binary relies
on such sysctl to work. On the other hand, there's no real reason to
keep'em around in 10.


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# d098f930 31-May-2011 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.

Reviewed by: jhb


# f2d2d694 23-May-2011 Robert Watson <rwatson@FreeBSD.org>

Rework netisr policy mechanism so that per-protocol dispatch policies can
be represented:

- A single policy namespace is defined, consisting of four possible
policies: "default" to use the global default, "deferred" to force
deferred dispatch, "direct" to employ direct dispatch where possible, and
"hybrid" which makes a dynamic decision based on CPU affinity, ordering,
etc. Routines are implemented to convert between strings and an integer
namespace.

- A new global variable, netisr_dispatch_policy, subsumes existing global
variables for direct dispatch, forced direct dispatch, etc, and is used
for explicit policy interpretation and composition. Old variables remain
so that they can be exported by legacy sysctls for use by old netstat(1)
binaries. A new sysctl and tunable, netisr.dispatch.policy, accepts the
above strings for specifying a global policy default.

- The protocol registration structure, netisr_handler, grows an nh_dispatch
field, which accepts a per-policy policy override. The default value is
'0', which corresponds to "default", meaning that protocols will accept
the global default policy unless otherwise specified.

- Policies are now interpreted and composed explicitly at various points in
packet dispatch; protocol policies override global policies.

- Protocols grow the ability to express a non-opinion about affinity even
when implenting m2cpuid by returning NETISR_CPUID_NONE. In that case, the
framework falls back on source ordering, rather than simply using the
current CPU.

These changes are in support of allowing link layer re-dispatch based on
RSS or similar hashes provided by NICs, especially in the case where the
number of hardware receive queues matches hardware core count, rather than
hardware thread count, requiring further software redistributeon. (i.e.,
on RMI XLR).

MFC after: 3 weeks
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# 0028e524 11-Feb-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 CH=177255:

Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS.

Change the syntax to match KASSERT() to allow more flexible panic
messages rather than having a printf with hardcoded arguments
before panic.

Adjust the few assertions we have to the new format (and enhance
the output).

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb

MFC after: 2 weeks


# f88910cd 12-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the net* piece.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 3aa6d94e 11-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Update several places that iterate over CPUs to use CPU_FOREACH().


# 25d3931a 31-May-2010 Robert Watson <rwatson@FreeBSD.org>

Merge r200899 from head to stable/8:

When warning about possible netisr configuration problems during boot,
report using "netisr_init" rather than "netisr2", which was the development
name for the project.

Approved by: re (kib)


# 938448cd 28-Feb-2010 Robert Watson <rwatson@FreeBSD.org>

Changes to support crashdump analysis of netisr:

- Rename the netisr protocol registration array, 'np' to 'netisr_proto',
in order to reduce the chances of symbol name collisions. It remains
statically defined, but it will be looked up by netstat(1).

- Move certain internal structure definitions from netisr.c to
netisr_internal.h so that netstat(1) can find them. They remain
private, and should not be used for any other purpose (for example,
they should not be used by kernel modules, which must instead use the
public interfaces in netisr.h).

- Store a kernel-compiled version of NETISR_MAXPROT in the global variable
netisr_maxprot, and export via a sysctl, so that it is available for use
by netstat(1). This is especially important for crashdump
interpretation, where the size of the workstream structure is determined
by the maximum number of protocols compiled into the kernel.

MFC after: 1 week
Sponsored by: Juniper Networks


# 7f450feb 25-Feb-2010 Robert Watson <rwatson@FreeBSD.org>

Fix edge cases in several KASSERTs: use <= rather than < when testing that
counters have not gone about MAXCPU or NETISR_MAXPROT. These problems
caused panics on UP kernels with INVARIANTS when using sysctl -a, but
would also have caused problems for 32-core boxes or if the netisr
protocol vector was fully populated.

Reported by: nwhitehorn, Neel Natu <neelnatu@gmail.com>
MFC after: 4 days


# 2d22f334 22-Feb-2010 Robert Watson <rwatson@FreeBSD.org>

Export netisr configuration and statistics to userspace via sysctl(9).

MFC after: 1 week
Sponsored by: Juniper Networks


# 78494902 15-Feb-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Mark various sysctls also as tunables.

Reviewed by: rwatson
MFC after: 1 week


# 912f6323 22-Dec-2009 Robert Watson <rwatson@FreeBSD.org>

When warning about possible netisr configuration problems during boot,
report using "netisr_init" rather than "netisr2", which was the development
name for the project.

MFC after: 3 days


# 0a32e29f 22-Dec-2009 Robert Watson <rwatson@FreeBSD.org>

Refine netisr.c comments a bit.


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# ba3b25b3 29-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

In case we cannot queue a packet reaching the queue limit, retain the
semantics netisr_queue() always had and free the mbuf along with
returning the error.

Reviewed by: rwatson
Approved by: re (kensmith)


# 9e6e01eb 26-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

In light of DPCPU use by netisr, revise various for loops from using
MAXCPU to mp_maxid, and handling and reporting of requests to use more
threads than we have CPUs to run them on.

Reviewed by: bz
Approved by: re (kib)
MFC after: 6 weeks


# 53402767 25-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Convert netisr to use dynamic per-CPU storage (DPCPU) instead of sizing
arrays to [MAXCPU], offering moderate memory savings. In some places,
this requires using CPU_ABSENT() to handle less common platforms with
sparse CPU IDs. In several places, assert that the selected CPUID for
work placement or statistics is not CPU_ABSENT() to be on the safe side.

Discussed with: bz, jeff


# ed655c8c 14-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Add an optional callback function that will be invoked when a per-CPU
queue was drained. It will never fire for a directly dispatched packet.

You will most likely never want to use this for any ordinary netisr usage
and you will never blame netisr in case you try to use it and it does
not work as expected.

Reviewed by: rwatson


# d363c617 01-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Revert a recent netisr2 change: when billing packets to the current
CPU, don't lock the workstream, as its mutexes may not have been
initialized if there are fewer workstreams than CPUs.

Run into by: hps, ps


# ed54411c 01-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Garbage collect NETISR_POLL and NETISR_POLLMORE, which are no longer
required for options DEVICE_POLLING.

De-fragment the NETISR_ constant space and lower NETISR_MAXPROT from
32 to 16 -- when sizing queue arrays using this compile-time constant,
significant amounts of memory are saved.

Warn on the console when tunable values for netisr are automatically
adjusted during boot due to exceeding limits, invalid values, or as a
result of DEVICE_POLLING.


# d4b5cae4 01-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.

In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
currently be one of:

NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).

NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.

NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.

- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.

- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by: bz


# 2f120c90 13-May-2009 Robert Watson <rwatson@FreeBSD.org>

Garbage collect now-unused NETISR_FORCEQUEUE, which overrode the global
direct dispatch policy for specific protocols (NETISR_USB). We leave
the additional 'flags' argument to netisr_register() for the time being,
even though it is no longer required.


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 59dd72d0 03-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 0bf686c1 06-Aug-2007 Robert Watson <rwatson@FreeBSD.org>

Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)


# 33d2bb9c 27-Jul-2007 Robert Watson <rwatson@FreeBSD.org>

First in a series of changes to remove the now-unused Giant compatibility
framework for non-MPSAFE network protocols:

- Remove debug_mpsafenet variable, sysctl, and tunable.
- Remove NET_NEEDS_GIANT() and associate SYSINITSs used by it to force
debug.mpsafenet=0 if non-MPSAFE protocols are compiled into the kernel.
- Remove logic to automatically flag interrupt handlers as non-MPSAFE if
debug.mpsafenet is set for an INTR_TYPE_NET handler.
- Remove logic to automatically flag netisr handlers as non-MPSAFE if
debug.mpsafenet is set.
- Remove references in a few subsystems, including NFS and Cronyx drivers,
which keyed off debug_mpsafenet to determine various aspects of their own
locking behavior.
- Convert NET_LOCK_GIANT(), NET_UNLOCK_GIANT(), and NET_ASSERT_GIANT into
no-op's, as their entire behavior was determined by the value in
debug_mpsafenet.
- Alias NET_CALLOUT_MPSAFE to CALLOUT_MPSAFE.

Many remaining references to NET_.*_GIANT() and NET_CALLOUT_MPSAFE are still
present in subsystems, and will be removed in followup commits.

Reviewed by: bz, jhb
Approved by: re (kensmith)


# 1f87450e 28-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Change net.isr.direct from defaulting to 0 to 1 in 7-CURRENT. This
enables direct dispatch of the network stack from the device driver
ithread, enabling input path parallelism by default when multiple
interfaces are present.

The strategy for network stack parallelism is something being actively
discussed, and this is just one of several possible (and perfectly
reasonable) strategies, but has the distinct advantage of reducing the
number of context switches and preemptions significantly, resulting in
higher efficiency in many cases. In some caes, this may reduce
network stack parallelism due to work not being deferred from the
ithread to the netisr. Therefore, the strategy may change in the
future, but this offers a reasonable first pass and enabling
parallelism while maintaining strong ordering.

Hopefully this will trigger lots of nice new bugs.

This change is not intended for MFC.


# f0796cd2 05-Oct-2005 Gleb Smirnoff <glebius@FreeBSD.org>

- Don't pollute opt_global.h with DEVICE_POLLING and introduce
opt_device_polling.h
- Include opt_device_polling.h into appropriate files.
- Embrace with HAVE_KERNEL_OPTION_HEADERS the include in the files that
can be compiled as loadable modules.

Reviewed by: bde


# cea2165b 04-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Rename net.isr.enable to net.isr.dispatch.

No compatibility code is provided, as this will be the production name
as of 6.0.

MFC after: 3 days
Requested by: scottl


# de10fe70 11-Oct-2004 Andre Oppermann <andre@FreeBSD.org>

Correctly unregister a netisr by clearing the ni->ni_queue field to NULL as
well. This field is actually used by various netisr functions to determine
the availablility of the specified netisr. This uncomplete unregister leads
directly to a crash when the KLD unregistering the netisr is unloaded.

Submitted by: Sam <sah@softcardsystems.com>
MFC after: 3 days


# ccaae37a 02-Sep-2004 Robert Watson <rwatson@FreeBSD.org>

Correct a comment typo: s/Note/Not/.

Pointed out by: kensmith


# ace437c3 28-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Correct typo in printf() warning.

Submitted by: Pawel Worach <pawel.worach at telia.com>


# 1d8cd39e 28-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Change the default disposition of debug.mpsafenet from 0 to 1, which
will cause the network stack to operate without the Giant lock by
default. This change has the potential to improve performance by
increasing parallelism and decreasing latency in network processing.

Due to the potential exposure of existing or new bugs, the following
compatibility functionality is maintained:

- It is still possible to disable Giant-free operation by setting
debug.mpsafenet to 0 in loader.conf.

- Add "options NET_WITH_GIANT", which will restore the default value of
debug.mpsafenet to 0, and is intended for use on systems compiled with
known unsafe components, or where a more conservative configuration is
desired.

- Add a new declaration, NET_NEEDS_GIANT("componentname"), which permits
kernel components to declare dependence on Giant over the network
stack. If the declaration is made by a preloaded module or a compiled
in component, the disposition of debug.mpsafenet will be set to 0 and
a warning concerning performance degraded operation printed to the
console. If it is declared by a loadable kernel module after boot, a
warning is displayed but the disposition cannot be changed. This is
implemented by defining a new SYSINIT() value, SI_SUB_SETTINGS, which
is intended for the processing of configuration choices after tunables
are read in and the console is available to generate errors, but
before much else gets going.

This compatibility behavior will go away when we've finished the last
of the locking work and are confident that operation is correct.


# 3161f583 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Apply error and success logic consistently to the function netisr_queue() and
its users.

netisr_queue() now returns (0) on success and ERRNO on failure. At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.

Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success. Due to this
schednetisr() was never called to kick the scheduling of the isr. However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.

PR: kern/70988
Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 3 days


# 08f85b08 18-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Comment clarifying debug_mpsafenet.


# 7902224c 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by: FreeBSD Foundation


# d3be1471 05-Nov-2003 Sam Leffler <sam@FreeBSD.org>

o make debug_mpsafenet globally visible
o move it from subr_bus.c to netisr.c where it more properly belongs
o add NET_PICKUP_GIANT and NET_DROP_GIANT macros that will be used to
grab Giant as needed when MPSAFE operation is enabled

Supported by: FreeBSD Foundation


# 5fd04e38 03-Oct-2003 Robert Watson <rwatson@FreeBSD.org>

When direct dispatching an netisr (net.isr.enable=1), if there are already
any queued packets for the isr, process those packets before the newly
submitted packet, maintaining ordering of all packets being delivered
to the netisr. Remove the bypass counter since we don't bypass anymore.
Leave the comment about possible problems and options since later
performance optimization may change the strategy for addressing ordering
problems here.

Specifically, this maintains the strong isr ordering guarantee; additional
parallelism and lower latency may be possible by moving to weaker
guarantees (per-interface, for example). We will probably at some point
also want to remove the one instance netisr dispatch limit currently
enforced by a mutex, but it's not clear that's 100% safe yet, even in
the netperf branch.

Reviewed by: sam, others


# e590eca2 01-Oct-2003 Robert Watson <rwatson@FreeBSD.org>

Create a tunable for net.isr.enable so that it may be set from
inception, rather than having to wait for the boot to finish.


# 3164565d 01-Oct-2003 Robert Watson <rwatson@FreeBSD.org>

Temporarily turn net.isr.enable back off again until patches to
correct potential nits in packet ordering are resolved.


# 19288f73 01-Oct-2003 Robert Watson <rwatson@FreeBSD.org>

Enable net.isr.enable by default, causing "delivery to completion"
(direct dispatch) in interrupt threads when the netisr in question
isn't already active. If a netisr is already active, or direct
dispatch is already in progress, we queue the packet for later
delivery. Previously, this option was disabled by default. I have
measured 20%+ performance improvements in IP packet forwarding with
this enabled.

Please report any problems ASAP, especially relating to stack depth or
out-of-order packet processing.

Discussed with: jlemon, peter
Sponsored by: DARPA, Network Associates Laboratories


# fb68148f 08-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Discard the packet if the netisr queue is null instead of panicing, for
the benefit of modules which are compiled differently than the kernel.


# 1cafed39 04-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Update netisr handling; Each SWI now registers its queue, and all queue
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.

Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs


# e3b6e33c 21-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Moved netisr code from kern/kern_intr.c to net/netisr.c as threatened in a
comment.