History log of /freebsd-current/sys/kern/uipc_socket.c
Revision Date Author Comments
# 99b0270a 06-May-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: hide socket hhook(9)s under SOCKET_HHOOK

There are no in-tree consumers of these hooks.

Reviewed by: stevek
Differential Revision: https://reviews.freebsd.org/D44928


# a8acc2bf 23-Apr-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: inherit SO_ACCEPTFILTER from listener to child

This is crucial for operation of accept_filter(9). See added comment.

Fixes: d29b95ecc0d049406d27a6c11939d40a46658733


# 81b4d1c4 08-Apr-2024 Stephen J. Kiernan <stevek@FreeBSD.org>

sockets: Add hhook in sonewconn for inheriting OSD specific data

Added HHOOK_SOCKET_NEWCONN and bumped HHOOK_SOCKET_LAST

Reviewed by: glebius, tuexen
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D44632


# aa32d7cb 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record socket violations with KTR_CAPFAIL

Report restricted access to socket addresses and protocols while
Capsicum violation tracing with CAPFAIL_ADDR and CAPFAIL_PROTO.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40681


# 681711b7 05-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

uipc_socket: handle socket buffer locks in sopeeloff

PR: 278171
Reviewed by: markj
Fixes: a4fc41423f7d ("sockets: enable protocol specific socket buffers")
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D44640


# 15bfd7cf 22-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

soreceive_dgram: use M_WAITOK when we don't hold any locks


# 26389b30 22-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

soreceive_dgram: assert that a datagram has control or data


# d62c4607 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: remove unused KPIs to manipulate sockets

These KPIs were added in dd0e6c383a9f0 and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet. But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet. For the sockets it is unclear what could be a large
collection of externally written kernel modules that need extensively use
sockets and not be aware of their internals at the same time. This
isolation of a structure knowledge requires a lot of work, and just
throwing in a few KPIs isn't helpful.

Reviewed by: kib, olce, markj
Differential Revision: https://reviews.freebsd.org/D44311


# 7ee47c3b 28-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: in solisten_proto() don't call sbdestroy() on a PR_SOCKBUF

A socket marked with PR_SOCKBUF has protocol specific socket buffers
and will take care of the in its pr_listen method. Right now we don't
have any sockets that can listen and are PR_SOCKBUF, but that will
change soon.


# ce69e373 03-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

Revert "sockets: retire sorflush()"

Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.


# f79a8585 30-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: garbage collect SS_ISCONFIRMING

Fixes: 8df32b19dee92b5eaa4b488ae78dca6accfcb38e


# 507f87a7 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: retire sorflush()

With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease(). The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2). The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety. This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over. I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43415


# 289bee16 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: remove dom_dispose and PR_RIGHTS

Passing file descriptors (rights) via sockets is a feature specific to
PF_UNIX only, so fully isolate the logic into uipc_usrreq.c.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43414


# 5bba2728 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make pr_shutdown fully protocol specific method

Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else. This also fixes a
couple recent regressions and reduces risk of future regressions. The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one. Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9). Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43413


# c3276e02 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make shutdown(2) how argument a enum

Reviwed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43412


# 59ce044a 08-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: on shutdown(2) do sorflush() only in case of generic sockbuf

This is a quick plug to fix panic with Netlink which has protocol specific
buffers. Note that PF_UNIX/SOCK_DGRAM, which also has its own buffers,
avoids the panic due to being SOCK_DGRAM. A correct but more complicated
fix that needs to be done is to merge pr_shutdown, pr_flush and dom_dispose
into one protocol method that may call sorflush for generic sockets or do
their own stuff for protocol which has own buffers.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43367
Reported-by: syzbot+a58e1615881c01a51653@syzkaller.appspotmail.com


# 0fac350c 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on getpeername/getsockname

Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.

Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694


# cfb1e929 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on accept(2)

Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.

While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.

Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# f64a688d 13-Nov-2023 Brooks Davis <brooks@FreeBSD.org>

Remove gratuitous copyouts of unchanged struct mac.

The get operations change the data pointed to by the structure, but do
not update the contents of the struct.

Mark the struct mac arguments of mac_[gs]etsockopt_*label() and
mac_check_structmac_consistent() const to prevent this from changing
in the future.

Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14488


# 978be1ee 09-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

sockets: Add sysctl flag CTLFLAG_TUN to loader tunable

The sysctl variable 'kern.ipc.maxsockets' is actually a loader tunable.
Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will report it
correctly.

No functional change intended.

Reviewed by: kib, imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D42113


# f4410241 09-Sep-2023 Greg Becker <becker.greg@att.net>

sockets: re-check socket state after call to pr_rcvd()

Socket state may have changed after dropping the receive
buffer lock in order to call pr_rcvd(). If the buffer is
empty, re-check the state after reaquiring the lock and
skip calling sbwait() if the socket is in error or the
peer has closed.

PR: 212716
Reviewed by: markj, glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D41783


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# d29b95ec 14-Aug-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: on accept(2) don't copy all of so_options to new socket

As uncovered by e3ba0d6adde3 we are copying lots of irrelevant options
from the listener to an accepted socket, even those that aren't relevant
to a non-listener, e.g. SO_REUSE*, SO_ACCEPTFILTER. Stop doing that
and provide a fixed opt-in list for options to be inherited. Ideally
we shall not inherit anything at all. For compatibility inherit a set
of options that are meaningful for a non-listening socket of a protocol
that can listen(2).

Differential Revision: https://reviews.freebsd.org/D41412
Fixes: e3ba0d6adde3c694f46a30b3b67eba43a7099395


# 4824d788 30-Apr-2023 Eugene Grosbein <eugen@FreeBSD.org>

listen(2): improve administrator control over logging

As documented in listen.2 manual page, the kernel emits a LOG_DEBUG
syslog message if a socket listen queue overflows. For some appliances,
it may be desirable to change the priority to some higher value
like LOG_INFO while keeping other debugging suppressed.

OTOH there are cases when such overflows are normal and expected.
Then it may be desirable to suppress overflow logging altogether,
so that dmesg buffer is not flooded over long run.

In addition to existing sysctl kern.ipc.sooverinterval,
introduce new sysctl kern.ipc.sooverprio that defaults to 7 (LOG_DEBUG)
to preserve current behavior. It may be changed to any value
in a range of 0..7 for corresponding priority or to -1 to suppress logging.
Document it in the listen.2 manual page.

MFC after: 1 month


# b4b33821 21-Mar-2023 Mark Johnston <markj@FreeBSD.org>

ktls: Fix interlocking between ktls_enable_rx() and listen(2)

The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check
whether the socket is listening socket and fail if so, but this check is
racy. Since we have to lock the socket buffer later anyway, defer the
check to that point.

ktls_enable_tx() locks the send buffer's I/O lock, which will fail if
the socket is a listening socket, so no explicit checks are needed. In
ktls_enable_rx(), which does not acquire the I/O lock (see the review
for some discussion on this), use an explicit SOLISTENING() check after
locking the recv socket buffer.

Otherwise, a concurrent solisten_proto() call can trigger crashes and
memory leaks by wiping out socket buffers as ktls_enable_*() is
modifying them.

Also make sure that a KTLS-enabled socket can't be converted to a
listening socket, and use SOCK_(SEND|RECV)BUF_LOCK macros instead of the
old ones while here.

Add some simple regression tests involving listen(2).

Reported by: syzkaller
MFC after: 2 weeks
Reviewed by: gallatin, glebius, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38504


# 636b19ea 14-Feb-2023 Mark Johnston <markj@FreeBSD.org>

tcp: Disallow re-connection of a connected socket

soconnectat() tries to ensure that one cannot connect a connected
socket. However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.

Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().

Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38507


# a0102dee 01-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: in sousrsend() pass down the error to aio(4)

This somewhat undermines the initial goal of sousrsend() to have all
the special error handling for a write on a socket in a single place.
The aio(4) needs to see EWOULDBLOCK to re-schedule the job. Because
aio(4) handles return from soreceive() and sousrsend() with the same
code, we can't check for (error == 0 && done < job_nbytes). Keeping
this exclusion for aio(4) seems a lesser evil.

Fixes: 7a2c93b86ef75390a60a4b4d6e3911b36221dfbe


# 7a2c93b8 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: provide sousrsend() that does socket specific error handling

Sockets have special handling for EPIPE on a write, that was spread out
into several places. Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31e).

- Provide sousrsend() that expects a valid uio, and leave sosend() for
kernel consumers only. Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
error handling for kern_sendit(), soo_write() and soaio_process_job().

PR: 265087
Reported by: rz-rpi03 at h-ka.de
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35863


# ebdf27b6 10-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

uipc: remove accept_mtx

It is unused since 779f106aa169256b ("Listening sockets improvements.")

Sponsored by: Rubicon Communications, LLC ("Netgate")


# bc0d4076 11-Oct-2022 Michael Tuexen <tuexen@FreeBSD.org>

Revert "listen(): improve POSIX compliance"

This reverts commit 76e6e4d72f8d3da7d19242f303bc95461fde7fb9.

Several programs in the tree use -1 instead of INT_MAX to use
the maximum value. Thanks to Eugene Grosbein for pointing this
out.


# 76e6e4d7 11-Oct-2022 Michael Tuexen <tuexen@FreeBSD.org>

listen(): improve POSIX compliance

Ensure that a negative backlog argument is handled as it if was 0.

Reviewed by: markj@, glebius@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31821


# f6696856 27-Sep-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

protocols: make socket buffers ioctl handler changeable

Allow to set custom per-protocol handlers for the socket buffers
ioctls by introducing pr_setsbopt callback with the default value
set to the currently-used sbsetopt().

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D36746


# e80062a2 08-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: avoid call to soisconnected() on transition to ESTABLISHED

This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d815 it definitely
became necessary always and commit message from f1ee30ccd60 confirms that.
Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488


# 24af7808 30-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: repair protocol selection logic in socket(2)

Pointy hat to: glebius
Fixes: 61f7427f02a307d28af674a12c45dd546e3898e4


# 61f7427f 30-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: cleanup protocols that existed merely to provide pr_input

Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6). This story ended with 78b1fc05b20.

These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.

With 78b1fc05b20 all these entries are no longer needed, as ip_protox
is independent of protosw. Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw. And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.

For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
of protosw value, which is now always 0.

Differential revision: https://reviews.freebsd.org/D36380


# 8624f434 30-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

divert: declare PF_DIVERT domain and stop abusing PF_INET

The divert(4) is not a protocol of IPv4. It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back. It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols. The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char. I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.

Moving divert(4) out of inetsw accomplished two important things:

1) IPDIVERT is getting much closer to be not dependent on INET.
This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
Domain/proto selection code won't need a hack for SOCK_RAW and
multiple entries in inetsw implementing different flavors of
raw socket can merge into one without requirement of raw IPv4
being the last member of dom_protosw.

Differential revision: https://reviews.freebsd.org/D36379


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# f277746e 12-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: change prototype for pr_control

For some reason protosw.h is used during world complation and userland
is not aware of caddr_t, a relic from the first version of C. Broken
buildworld is good reason to get rid of yet another caddr_t in kernel.

Fixes: 886fc1e80490fb03e72e306774766cbb2c733ac6


# 07285bb4 10-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: utilize new solisten_clone() and solisten_enqueue()

This streamlines cloning of a socket from a listener. Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.

Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d815). Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb. And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected(). This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36064


# 8f5a0a2e 10-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: provide solisten_clone(), solisten_enqueue()

as alternative KPI to sonewconn(). The latter has three stages:
- check the listening socket queue limits
- allocate a new socket
- call into protocol attach method
- link the new socket into the listen queue of the listening socket

The attach method, originally designed for a creation of socket by the
socket(2) syscall has slightly different semantics than attach of a socket
cloned by listener. Make it possible for protocols to call into the
first stage, then perform a different attach, and then call into the
final stage. The first stage, that checks limits and clones a socket
is called solisten_clone(), and the function that enqueues the socket
is solisten_enqueue().

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D36063


# be1f485d 25-Jul-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().

Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.

This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).

PR: kern/176322
Reviewed by: pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after: 1 month


# c261510e 08-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket

MFC after: 3 weeks


# d8596171 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use only soref()/sorele() as socket reference count

o Retire SS_FDREF as it is basically a debug flag on top of already
existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private. The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues. Until now the sockets on a queue had zero
reference count. And the reference were given only upon accept(2). The
assumption was that there is no way to see the queued socket from anywhere
except its head. This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists. With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision: https://reviews.freebsd.org/D35679


# bc760564 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use positive flag for file descriptor socket reference

Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.

This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D35678


# 66c8e3fc 30-Jun-2022 Gleb Smirnoff <glebius@FreeBSD.org>

socket: fix listen(2) on an already listening socket

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35669
Fixes: 141fe2dceeaeefaaffc2242c8652345a081e825a


# a4fc4142 24-Jun-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: enable protocol specific socket buffers

Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want. Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35299


# f6379f7f 16-Jun-2022 Mark Johnston <markj@FreeBSD.org>

socket: Fix a race between kevent(2) and listen(2)

When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.

If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.

Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35492


# a8e286bb 03-Jun-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use socket buffer mutexes in struct socket directly

Convert more generic socket code to not use sockbuf compat pointer.
Continuation of 4328318445a.


# 37351133 14-May-2022 Rick Macklem <rmacklem@FreeBSD.org>

uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records

Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.

This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.

It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.

This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.

Reviewed by: hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35170


# 43283184 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use socket buffer mutexes in struct socket directly

Since c67f3b8b78e the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b50 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152


# 2e4e5ee2 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: delete stale comment from sofree()

First paragraph refers to old past "we used to" and is no longer
important today. Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.


# a982ce04 09-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: remove the socket-on-stack hack from sorflush()

The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary. Why can't we call into dom_dispose under
splimp()? Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9b allow us to get
rid of the hack.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35125


# 42f2fa99 09-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't call dom_dispose() on a listening socket

sorflush() already did the right thing, so only sofree() needed
a fix. Turn check into assertion in our only dom_dispose method.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35124


# c17418a0 09-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: assert that any protocol with PR_RIGHTS has dom_dispose()

Through the entire history only PF_UNIX has this feature.

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35123


# 97f8198e 09-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make SO_SND/SO_RCV a enum

Not a functional change now. The enum will also be used for other socket
buffer related KPIs.


# aeb91e95 26-Mar-2022 Alexander Leidinger <netchild@FreeBSD.org>

Log euid, rgid and jail on listen queue overflow

If you have numerous jails with multiple similar services running,
this helps to narrow down which services this log is referring to.


# 5de79eed 07-Feb-2022 Mark Johnston <markj@FreeBSD.org>

ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode

There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen. Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.

Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.

Add regression tests to exercise this case.

Reported by: syzkaller
Reviewed by: gallatin, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34195


# fe27f1db 25-Dec-2021 Alexander Motin <mav@FreeBSD.org>

kern: Remove CTLFLAG_NEEDGIANT from some sysctls.

MFC after: 2 weeks


# f9978339 20-Dec-2021 Hans Petter Selasky <hselasky@FreeBSD.org>

Remove dead code.

The variable orig_resid is always set to zero right after the while loop
where it is cleared.

Reviewed by: gallatin@ and glebius@
Differential Revision: https://reviews.freebsd.org/D33589
MFC after: 1 week
Sponsored by: NVIDIA Networking


# e3ba94d4 09-Nov-2021 John Baldwin <jhb@FreeBSD.org>

Don't require the socket lock for sorele().

Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741


# c441592a 03-Nov-2021 Allan Jude <allanjude@FreeBSD.org>

Allow kern.ipc.maxsockets to be set to current value without error

Normally setting kern.ipc.maxsockets returns EINVAL if the new value
is not greater than the previous value. This can cause spurious
error messages when sysctl.conf is processed multiple times, or when
automation systems try to ensure the sysctl is set to the correct
value. If the value is unchanged, then just do nothing.

PR: 243532
Reviewed by: markj
MFC after: 3 days
Sponsored by: Modirum MDPay
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D32775


# a37e4fd1 01-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Re-style dfcef8771484 to keep the code and variables related to
listening sockets separated from code for generic sockets.

No objection: markj


# ade1daa5 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Synchronize soshutdown() with listen(2) and AIO

To handle shutdown(SHUT_RD) we flush the receive buffer of the socket.
This may involve searching for control messages of type SCM_RIGHTS,
since we need to close the file references. Closing arbitrary files
with socket buffer locks held is undesirable, mainly due to lock
ordering issues, so we instead make a copy of the socket buffer and
operate on that without any locks. Fields in the original buffer are
cleared.

This behaviour clobbered the AIO job queue associated with a receive
buffer. It could also cause us to leak a KTLS session reference.
Reorder socket buffer fields to address this.

An alternate solution would be to remove the hack in sorflush(), but
this is not quite feasible (yet). In particular, though sorflush()
flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to
be queued - the flag just prevents userspace from reading more data. I
suspect we should fix this; SBS_CANTRCVMORE represents a terminal state
and protocols can likely just drop any data destined for such a buffer.
Many of them already do, but in some cases the check is racy, and some
KPI churn will be needed to fix everything. This approach is more
straightforward for now.

Reported by: syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com
Reported by: syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31976


# 883761f0 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Remove NOFREE from the socket zone

This flag was added during the transition away from the legacy zone
allocator, commit c897b81311792ccf6a93feff2a405e2ae53f664e. The old
zone allocator effectively provided _NOFREE semantics, but it seems that
they are not required for sockets. In particular, we use reference
counting to keep sockets live.

One somewhat dangerous case is sonewconn(), which returns a pointer to a
socket with reference count 0. This socket is still effectively owned
by the listening socket. Protocols must therefore be careful to
synchronize sonewconn() calls with their pru_close implementations,
since for listening sockets soclose() will abort the child sockets. For
example, TCP holds the listening socket's PCB read locked across the
sonewconn() call, which blocks tcp_usr_close(), and sofree()
synchronizes with a concurrent soabort() of the nascent socket.
However, _NOFREE semantics are not required here.

Eliminating _NOFREE has several benefits: it enables use-after-free
detection (e.g., by KASAN) and lets the system reclaim memory from the
socket zone under memory pressure. No functional change intended.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31975


# 6b288408 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Add assertions around naked refcount decrements

Sockets in a listen queue hold a reference to the parent listening
socket. Several code paths release this reference manually when moving
a child socket out of the queue.

Replace comments about the expected post-decrement refcount value with
assertions. Use refcount_load() instead of a plain load. No functional
change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31974


# dfcef877 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Fix a use-after-free in soclose()

After releasing the fd reference to a socket "so", we should avoid
testing SOLISTENING(so) since the socket may have been freed. Instead,
directly test whether the list of unaccepted sockets is empty.

Fixes: f4bb1869ddd2 ("Consistently use the SOLISTENING() macro")
Pointy hat: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31973


# fa0463c3 14-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: De-duplicate SBLOCKWAIT() definitions

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 141fe2dc 10-Sep-2021 Mark Johnston <markj@FreeBSD.org>

aio: Interlock with listen(2)

soo_aio_queue() did not handle the possibility that the provided socket
is a listening socket. Up until recently, to fix this one would have to
acquire the socket lock first and check, since the socket buffer locks
were destroyed by listen(2).

Now that the socket buffer locks belong to the socket, simply check
SOLISTENING(so) after acquiring them, and make listen(2) return an error
if any AIO jobs are enqueued on the socket.

Add a couple of simple regression test cases.

Note that this fixes things only for the default AIO implementation;
cxgbe(4)'s TCP offload has a separate pru_aio_queue implementation which
requires its own solution.

Reported by: syzbot+c8aa122fa2c6a4e2a28b@syzkaller.appspotmail.com
Reported by: syzbot+39af117d43d4f0faf512@syzkaller.appspotmail.com
Reported by: syzbot+60cceb9569145a0b993b@syzkaller.appspotmail.com
Reported by: syzbot+2d522c5db87710277ca5@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31901


# 523d58aa 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Remove unneeded SOLISTENING checks

Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate
some racy SOLISTENING() checks. No functional change intended.

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31660


# bd4a39cc 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Properly interlock when transitioning to a listening socket

Currently, most protocols implement pru_listen with something like the
following:

SOCK_LOCK(so);
error = solisten_proto_check(so);
if (error) {
SOCK_UNLOCK(so);
return (error);
}
solisten_proto(so);
SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called. Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31659


# f94acf52 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Rename sb(un)lock() and interlock with listen(2)

In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().

Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31657


# 7045b160 28-Jul-2021 Roy Marples <roy@marples.name>

socket: Implement SO_RERROR

SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652


# a1002174 14-Jun-2021 Mark Johnston <markj@FreeBSD.org>

Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros

This makes it easier to change the socket locking protocols. No
functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# f4bb1869 14-Jun-2021 Mark Johnston <markj@FreeBSD.org>

Consistently use the SOLISTENING() macro

Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# aa341db3 25-May-2021 John Baldwin <jhb@FreeBSD.org>

Rename m_unmappedtouio() to m_unmapped_uiomove().

This function doesn't only copy data into a uio but instead is a
variant of uiomove() similar to uiomove_fromphys().

Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30444


# 916c61a5 21-May-2021 Mark Johnston <markj@FreeBSD.org>

Fix handling of errors from pru_send(PRUS_NOTREADY)

PRUS_NOTREADY indicates that the caller has not yet populated the chain
with data, and so it is not ready for transmission. This is used by
sendfile (for async I/O) and KTLS (for encryption). In particular, if
pru_send returns an error, the caller is responsible for freeing the
chain since other implicit references to the data buffers exist.

For async sendfile, it happens that an error will only be returned if
the connection was dropped, in which case tcp_usr_ready() will handle
freeing the chain. But since KTLS can be used in conjunction with the
regular socket I/O system calls, many more error cases - which do not
result in the connection being dropped - are reachable. In these cases,
KTLS was effectively assuming success.

So:
- Change sosend_generic() to free the mbuf chain if
pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the
chain at that point.
- Similarly, in vn_sendfile() change the !async I/O && KTLS case to free
the chain.
- If async I/O is still outstanding when pru_send fails in
vn_sendfile(), set an error in the sfio structure so that the
connection is aborted and the mbuf chain is freed.

Reviewed by: gallatin, tuexen
Discussed with: jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30349


# b295c5dd 18-May-2021 Lv Yunlong <lylgood@foxmail.com>

socket: Release cred reference later in sodealloc()

We dereference so->so_cred to update the per-uid socket buffer
accounting, so the crfree() call must be deferred until after that
point.

PR: 255869
MFC after: 1 week


# d8acd268 12-May-2021 Mark Johnston <markj@FreeBSD.org>

Fix mbuf leaks in various pru_send implementations

The various protocol implementations are not very consistent about
freeing mbufs in error paths. In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30151


# 3aaaa2ef 28-Apr-2021 Thomas Munro <tmunro@FreeBSD.org>

poll(2): Add POLLRDHUP.

Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if
requested. Triggered when the remote peer shuts down writing or closes
its end.

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D29757


# 27457983 07-Apr-2021 Mark Johnston <markj@FreeBSD.org>

capsicum: Limit socket operations in capability mode

Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by: oshogbo
Discussed with: emaste
Reported by: manu
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29423


# f187d6df 15-Mar-2021 Kyle Evans <kevans@FreeBSD.org>

base: remove if_wg(4) and associated utilities, manpage

After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree. This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.


# 74ae3f3e 14-Mar-2021 Kyle Evans <kevans@FreeBSD.org>

if_wg: import latest fixup work from the wireguard-freebsd project

This is the culmination of about a week of work from three developers to
fix a number of functional and security issues. This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
the interface's home vnet so that it can act as the sole network
connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
complete. It is additionally supported by the upstream
wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib. iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations. This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after: 1 month (maybe)


# 504ebd61 20-Jan-2021 Kyle Evans <kevans@FreeBSD.org>

kern: sonewconn: set so_options before pru_attach()

Protocol attachment has historically been able to observe and modify
so->so_options as needed, and it still can for newly created sockets.
779f106aa169 moved this to after pru_attach() when we re-acquire the
lock on the listening socket.

Restore the historical behavior so that pru_attach implementations can
consistently use it. Note that some pru_attach() do currently rely on
this, though that may change in the future. D28265 contains a change to
remove the use in TCP and IB/SDP bits, as resetting the requested linger
time on incoming connections seems questionable at best.

This does move the assignment out from under the head's listen lock, but
glebius notes that head won't be going away and applications cannot
assume any specific ordering with a race between a connection coming in
and the application changing socket options anyways.

Discussed-with: glebius
MFC-after: 1 week


# 924d1c9a 08-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca5c2e5124958363b3326708b913af71.


# 4a01b854 07-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.


# 34af05ea 03-Dec-2020 Kyle Evans <kevans@FreeBSD.org>

kern: soclose: don't sleep on SO_LINGER w/ timeout=0

This is a valid scenario that's handled in the various protocol layers where
it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it
indicates we should immediately drop the connection, it makes little sense
to sleep on it.

This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this
could result in the thread hanging until a signal interrupts it if the
protocol does not mark the socket as disconnected for whatever reason.

Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com
Reviewed by: glebius, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27407


# e90afaa0 08-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

kqueue: save space by using only one func pointer for assertions


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# 102829aa 19-Aug-2020 Rick Macklem <rmacklem@FreeBSD.org>

Add the MSG_TLSAPPDATA flag to indicate "return ENXIO" for non-application TLS
data records.

The kernel RPC cannot process non-application data records when
using TLS. It must to an upcall to a userspace daemon that will
call SSL_read() to process them.

This patch adds a new flag called MSG_TLSAPPDATA that the kernel
RPC can use to tell sorecieve() to return ENXIO instead of a non-application
data record, when that is what is at the top of the receive queue.
I put the code in #ifdef KERN_TLS/#endif, although it will build without
that, so that it is recognized as only useful when KERN_TLS is enabled.
The alternative to doing this is to have the kernel RPC re-queue the
non-application data message after receiving it, but that seems more
complicated and might introduce message ordering issues when there
are multiple non-application data records one after another.

I do not know what, if any, changes will be required to support TLS1.3.

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D25923


# 0f70a148 29-Jul-2020 John Baldwin <jhb@FreeBSD.org>

Properly handle a closed TLS socket with pending receive data.

If the remote end closes a TLS socket and the socket buffer still
contains not-yet-decrypted TLS records but no decrypted TLS records,
soreceive needs to block or fail with EWOULDBLOCK. Previously it was
trying to return data and dereferencing a NULL pointer.

Reviewed by: np
Sponsored by: Chelsio
Differential Revision: https://reviews.freebsd.org/D25838


# 3c0e5685 23-Jul-2020 John Baldwin <jhb@FreeBSD.org>

Add support for KTLS RX via software decryption.

Allow TLS records to be decrypted in the kernel after being received
by a NIC. At a high level this is somewhat similar to software KTLS
for the transmit path except in reverse. Protocols enqueue mbufs
containing encrypted TLS records (or portions of records) into the
tail of a socket buffer and the KTLS layer decrypts those records
before returning them to userland applications. However, there is an
important difference:

- In the transmit case, the socket buffer is always a single "record"
holding a chain of mbufs. Not-yet-encrypted mbufs are marked not
ready (M_NOTREADY) and released to protocols for transmit by marking
mbufs ready once their data is encrypted.

- In the receive case, incoming (encrypted) data appended to the
socket buffer is still a single stream of data from the protocol,
but decrypted TLS records are stored as separate records in the
socket buffer and read individually via recvmsg().

Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

- Socket buffers now contain an additional chain of mbufs (sb_mtls,
sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by
the protocol layer. These mbufs are still marked M_NOTREADY, but
soreceive*() generally don't know about them (except that they will
block waiting for data to be decrypted for a blocking read).

- Each time a new mbuf is appended to this TLS mbuf chain, the socket
buffer peeks at the TLS record header at the head of the chain to
determine the encrypted record's length. If enough data is queued
for the TLS record, the socket is placed on a per-CPU TLS workqueue
(reusing the existing KTLS workqueues and worker threads).

- The worker thread loops over the TLS mbuf chain decrypting records
until it runs out of data. Each record is detached from the TLS
mbuf chain while it is being decrypted to keep the mbufs "pinned".
However, a new sb_dtlscc field tracks the character count of the
detached record and sbcut()/sbdrop() is updated to account for the
detached record. After the record is decrypted, the worker thread
first checks to see if sbcut() dropped the record. If so, it is
freed (can happen when a socket is closed with pending data).
Otherwise, the header and trailer are stripped from the original
mbufs, a control message is created holding the decrypted TLS
header, and the decrypted TLS record is appended to the "normal"
socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)

Tested by: rmacklem (various versions)
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24628


# 95033af9 18-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Add the SCTP_SUPPORT kernel option.

This is in preparation for enabling a loadable SCTP stack. Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 2684603c 28-May-2020 John Baldwin <jhb@FreeBSD.org>

Permit SO_NO_DDP and SO_NO_OFFLOAD to be read via getsockopt(2).

MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24627


# 469f2e9e 27-May-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix sosend() for the case where mbufs are passed in while doing ktls.

For kernel tls, sosend() needs to call ktls_frame() on the mbuf list
to be sent. Without this patch, this was only done when sosend()'s
arguments used a uio_iov and not when an mbuf list is passed in.
At this time, sosend() is never called with an mbuf list argument when
kernel tls is in use, but will be once nfs-over-tls has been incorporated
into head.

Reviewed by: gallatin, glebius
Differential Revision: https://reviews.freebsd.org/D24674


# 0532a7a2 14-May-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix r361037.

Reorder flag manipulations and use barrier to ensure that the program
order is followed by compiler and CPU, for unlocked reader of so_state.

In collaboration with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24842


# 39845728 14-May-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix spurious ENOTCONN from closed unix domain socket other' side.

Sometimes, when doing read(2) over unix domain socket, for which the
other side socket was closed, read(2) returns -1/ENOTCONN instead of
EOF AKA zero-size read. This is because soreceive_generic() does not
lock socket when testing the so_state SS_ISCONNECTED|SS_ISCONNECTING
flags. It could end up that we do not observe so->so_rcv.sb_state bit
SBS_CANTRCVMORE, and then miss SS_ flags.

Change the test to check that the socket was never connected before
returning ENOTCONN, by adding all state bits for connected.

Reported and tested by: pho
In collaboration with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24819


# 6edfd179 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.1: mechanically rename M_NOMAP to M_EXTPG

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 03066893 27-Apr-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix sosend_generic() so that it can handle a list of ext_pgs mbufs.

Without this patch, sosend_generic() will try to use top->m_pkthdr.len,
assuming that the first mbuf has a pkthdr.
When a list of ext_pgs mbufs is passed in, the first mbuf is not a
pkthdr and cannot be post-r359919. As such, the value of top->m_pkthdr.len
is bogus (0 for my testing).
This patch fixes sosend_generic() to handle this case, calculating the
total length via m_length() for this case.

There is currently nothing that hands a list of ext_pgs mbufs to
sosend_generic(), but the nfs-over-tls kernel RPC code in
projects/nfs-over-tls will do that and was used to test this patch.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24568


# f1f93475 27-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Initial support for kernel offload of TLS receive.

- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
authentication algorithms and keys as well as the initial sequence
number.

- When reading from a socket using KTLS receive, applications must use
recvmsg(). Each successful call to recvmsg() will return a single
TLS record. A new TCP control message, TLS_GET_RECORD, will contain
the TLS record header of the decrypted record. The regular message
buffer passed to recvmsg() will receive the decrypted payload. This
is similar to the interface used by Linux's KTLS RX except that
Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
1.2, though I believe we will be able to reuse the same interface
and structures for 1.3.


# fb401f1b 14-Apr-2020 Jonathan T. Looney <jtl@FreeBSD.org>

Make sonewconn() overflow messages have per-socket rate-limits and values.

sonewconn() emits debug-level messages when a listen socket's queue
overflows. Currently, sonewconn() tracks overflows on a global basis. It
will only log one message every 60 seconds, regardless of how many sockets
experience overflows. And, when it next logs at the end of the 60 seconds,
it records a single message referencing a single PCB with the total number
of overflows across all sockets.

This commit changes to per-socket overflow tracking. The code will now
log one message every 60 seconds per socket. And, the code will provide
per-socket queue length and overflow counts. It also provides a way to
change the period between log messages using a sysctl.

Reviewed by: jhb (previous version), bcr (manpages)
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24316


# f6ab9795 14-Apr-2020 Jonathan T. Looney <jtl@FreeBSD.org>

Print more detail as part of the sonewconn() overflow message.

When a socket's listen queue overflows, sonewconn() emits a debug-level
log message. These messages are sometimes useful to systems administrators
in highlighting a process which is not keeping up with its listen queue.

This commit attempts to enhance the usefulness of this message by printing
more details about the socket's address. If all else fails, it will at
least print the domain name of the socket.

Reviewed by: bz, jhb, kbowling
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24272


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# f85e1a80 25-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ktls_frame() never fail. Caller must supply correct mbufs.
This makes sendfile code a bit simplier.


# 975b8f84 09-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Cleanup unneeded includes that crept in with r353292.


# 9e14430d 08-Oct-2019 John Baldwin <jhb@FreeBSD.org>

Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.

This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler. This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE. Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 75697b16 18-Aug-2019 Andrey V. Elsukov <ae@FreeBSD.org>

Use TAILQ_FOREACH_SAFE() macro to avoid use after free in soclose().

PR: 239893
MFC after: 1 week


# a85b7f12 14-Jul-2019 Michael Tuexen <tuexen@FreeBSD.org>

Improve the input validation for l_linger.
When using the SOL_SOCKET level socket option SO_LINGER, the structure
struct linger is used as the option value. The component l_linger is of
type int, but internally copied to the field so_linger of the structure
struct socket. The type of so_linger is short, but it is assumed to be
non-negative and the value is used to compute ticks to be stored in a
variable of type int.

Therefore, perform input validation on l_linger similar to the one
performed by NetBSD and OpenBSD.

Thanks to syzkaller for making me aware of this issue.

Thanks to markj@ for pointing out that a similar check should be added
to so_linger_set().

Reviewed by: markj@
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20948


# 6d958292 02-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Fix handling of errors from sblock() in soreceive_stream().

Previously we would attempt to unlock the socket buffer despite having
failed to lock it. Simply return an error instead: no resources need
to be released at this point, and doing so is consistent with
soreceive_generic().

PR: 238789
Submitted by: Greg Becker <greg@codeconcepts.com>
MFC after: 1 week


# 82334850 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# 1db2626a 27-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Fix comment in sofree() to reference sbdestroy().

r160875 added sbdestroy() as a wrapper around sbrelease_internal to be
called from sofree(), yet the comment added in the same revision to
sofree() still mentions sbrelease_internal().

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20488


# 3fe00ac4 03-Mar-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Remove bogus assert that I added in r319722. It is a legitimate case
to call soabort() on a newborn socket created by sonewconn() in case
if further setup of PCB failed. Code in sofree() handles such socket
correctly.

Submitted by: jtl, rrs
MFC after: 3 weeks


# 7dff7eda 13-Jan-2019 Jason A. Harmening <jah@FreeBSD.org>

Handle SIGIO for listening sockets

r319722 separated struct socket and parts of the socket I/O path into
listening-socket-specific and dataflow-socket-specific pieces. Listening
socket connection notifications are now handled by solisten_wakeup() instead
of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the
owning process.

PR: 234258
Reported by: Kenneth Adelman
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18664


# bcc3cec4 09-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Simplify sosetopt() so that function has single return point. No
functional change.


# 2f2ddd68 04-Jan-2019 Mark Johnston <markj@FreeBSD.org>

Support MSG_DONTWAIT in send*(2).

As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead. Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: tuexen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18728


# 79db6fe7 22-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Plug some networking sysctl leaks.

Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland. Fix a number of such handlers.

Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: tuexen
MFC after: 3 days
Security: kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18301


# e77f0bdc 18-Oct-2018 Jonathan T. Looney <jtl@FreeBSD.org>

r334853 added a "socket destructor" callback. However, as implemented, it
was really a "socket close" callback.

Update the socket destructor functionality to run when a socket is
destroyed (rather than when it is closed). The original submitter has
confirmed that this change satisfies the intended use case.

Suggested by: rwatson
Submitted by: Michio Honda <micchie at sfc.wide.ad.jp>
Tested by: Michio Honda <micchie at sfc.wide.ad.jp>
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17590


# ad7eb8ca 03-Oct-2018 Gleb Smirnoff <glebius@FreeBSD.org>

In PR 227259, a user is reporting that they have code which is using
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.

It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.

This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.

PR: 227259
Obtained from: jtl
Approved by: re (kib)
Differential Revision: D15019


# 6b01d4d4 21-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Add SOL_SOCKET level socket option with name SO_DOMAIN to get
the domain of a socket.

This is helpful when testing and Solaris and Linux have the same
socket option using the same name.

Reviewed by: bcr@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16791


# 3a20f06a 10-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Use uintptr_t alone when assigning to kvaddr_t variables.

Suggested by: jhb


# 7524b4c1 06-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Correct breakage on 32-bit platforms from r335979.


# f38b68ae 05-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Make struct xinpcb and friends word-size independent.

Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662. This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR: 228301 (exp-run by antoine)
Reviewed by: jtl, kib, rwatson (various versions)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15386


# 0ea9d937 11-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

limit change to fixing controlp handling pending review


# c34bf300 11-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

soreceive_stream: correctly handle edge cases

- non NULL controlp is not an error, returning EINVAL
would cause X forwarding to fail

- MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still
want to handle them - punt to soreceive_generic


# 1fbe13cf 08-Jun-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add a socket destructor callback. This allows kernel providers to set
callbacks to perform additional cleanup actions at the time a socket is
closed.

Michio Honda presented a use for this at BSDCan 2018.
(See https://www.bsdcan.org/2018/schedule/events/965.en.html .)

Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version)
Reviewed by: lstewart (previous version)
Differential Revision: https://reviews.freebsd.org/D15706


# 1a43cff9 06-Jun-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# 7875017c 24-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r332894 at the request of the submitter.

Submitted by: Johannes Lundberg <johalun0_gmail.com>
Sponsored by: Limelight Networks


# 7b7796ee 23-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 584ab65a 14-Sep-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Fix locking in soisconnected().

When a newborn socket moves from incomplete queue to complete
one, we need to obtain the listening socket lock after the child,
which is a wrong order. The old code did that in potentially
endless loop of mtx_trylock(). The new one does only one attempt
of mtx_trylock(), and in case of failure references listening
socket, unlocks child and locks everything in right order. In
case if listening socket shuts down during that, just bail out.

Reported & tested by: Jason Eggleston <jeggleston llnw.com>
Reported & tested by: Jason Wolfe <jason llnw.com>


# 555b3e2f 24-Aug-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Third take on the r319685 and r320480. Actually allow for call soisconnected()
via soisdisconnected(), and in the earlier unlock earlier to avoid lock
recursion.

This fixes a situation when a socket on accept queue is reset before being
accepted.

Reported by: Jason Eggleston <jeggleston llnw.com>


# 27d8bea8 21-Jul-2017 Michael Tuexen <tuexen@FreeBSD.org>

Fix getsockopt() for listening sockets when using SO_SNDBUF, SO_RCVBUF,
SO_SNDLOWAT, SO_RCVLOWAT. Since r31972 it only worked for non-listening
sockets.

Sponsored by: Netflix, Inc.


# fe715b80 04-Jul-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

After r319722 two fields were left uninitialized when transforming a
socket structure into a listening socket. This resulted in an invalid
instruction fault for all 32-bit platforms.

When INVARIANTS is set the union where the two uninitialized fields
reside gets properly zeroed. This patch ensures the two uninitialized
fields are zeroed when INVARIANTS is undefined.

For 64-bit platforms this issue was not visible because so->sol_upcall
which is uninitialized overlaps with so->so_rcv.sb_state which is
already zero during soalloc();

For 32-bit platforms this issue was visible and resulted in an invalid
instruction fault, because so->sol_upcall overlaps with
so->so_rcv.sb_sel which is always initialized to a valid data pointer
during soalloc().

Verifying the offset locations mentioned above are identical is left
as an exercise to the reader.

PR: 220452
PR: 220358
Reviewed by: ae (network), gallatin
Differential Revision: https://reviews.freebsd.org/D11475
Sponsored by: Mellanox Technologies


# 64290bef 24-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Provide sbsetopt() that handles socket buffer related socket options.
It distinguishes between data flow sockets and listening sockets, and
in case of the latter doesn't change resource limits, since listening
sockets don't hold any buffers, they only carry values to be inherited
by their children.


# 2b8e036b 15-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Plug read(2) and write(2) on listening sockets.


# 779f106a 08-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Listening sockets improvements.

o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.

o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770


# b3244df7 06-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Provide typedef for socket upcall function.
While here change so_gen_t type to modern uint64_t.


# b94f68dc 06-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Remove a piece of dead code.


# 971af2a3 02-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Rename accept filter getopt/setopt functions, so that they are prefixed
with module name and match other functions in the module. There is no
functional change.


# 14315212 25-Apr-2017 Patrick Kelsey <pkelsey@FreeBSD.org>

Remove unnecessary check for NULL mbuf in soreceive_generic().

This check has been redundant since it was introduced in r162554.

Reviewed by: emaste, glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10322


# 63649db0 14-Apr-2017 Maxim Sobolev <sobomax@FreeBSD.org>

Restore ability to shutdown DGRAM sockets, still forcing ENOTCONN to be returned
by the shutdown(2) system call. This ability has been lost as part of the svn
revision 285910.

Reviewed by: ed, rwatson, glebius, hiren
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10351


# 4b481ba0 01-Feb-2017 Hartmut Brandt <harti@FreeBSD.org>

Merge filt_soread and filt_solisten and decide what to do when checking
for EVFILT_READ at the point of the check not when the event is registers.
This fixes a problem with asio when accepting a connection.

Reviewed by: kib@, Scott Mitchell


# f3e7afe2 18-Jan-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months


# 339efd75 16-Jan-2017 Maxim Sobolev <sobomax@FreeBSD.org>

Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171


# 7d03ff1f 16-Jan-2017 Hiren Panchasara <hiren@FreeBSD.org>

Add kevent EVFILT_EMPTY for notification when a client has received all data
i.e. everything outstanding has been acked.

Reviewed by: bz, gnn (previous version)
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9150


# 14da48cb 06-Jan-2017 John Baldwin <jhb@FreeBSD.org>

Set MORETOCOME for AIO write requests on a socket.

Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend*
set PRUS_MOREOTOCOME when invoking the protocol send method. The aio
worker tasks for sending on a socket set this flag when there are
additional write jobs waiting on the socket buffer.

Reviewed by: adrian
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D8955


# dd7d4f19 22-Nov-2016 Ruslan Bukin <br@FreeBSD.org>

Revert r306186 ("Adjust the sopt_val pointer on bigendian systems").

This logic doesn't work with bigger sopt_valsize (e.g. when ipfw
passing 2048 bytes rule).

Reported by: adrian
Sponsored by: DARPA, AFRL


# 30f3bfe5 21-Sep-2016 Ruslan Bukin <br@FreeBSD.org>

Adjust the sopt_val pointer on bigendian systems (e.g. MIPS64EB).

sooptcopyin() checks if size of data provided by user is <= than we can
accept, else it strips down the size. On bigendian platforms we have to
move pointer as well so we copy the actual data.

Reviewed by: gnn
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D7980


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# c3bef61e 15-Sep-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.

Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D7878


# 306e53bc 22-May-2016 Baptiste Daroussin <bapt@FreeBSD.org>

Fix typo introduced by me (not the submitter) when fixing typos


# 2fd642c8 22-May-2016 Baptiste Daroussin <bapt@FreeBSD.org>

Fix typos in the comments

Submitted by: cipherwraith666@gmail.com (via github)


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# 8722384b 29-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Introduce a new protocol hook pru_aio_queue.

This allows a protocol to claim individual AIO requests instead of using
the default socket AIO handling.

Sponsored by: Chelsio Communications


# ab771750 21-Mar-2016 Maxim Konovalov <maxim@FreeBSD.org>

o "avaliable" -> "available".

PR: 208141
Submitted by: Tyler Littlefield


# f3215338 01-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order. Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels. The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon. This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system. To protect against this, AIO
requests are only permitted for known "safe" files by default. AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default. aio_mlock() is also enabled by default.

Reviewed by: cem, jilles
Discussed with: kib (earlier version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5289


# 7325dfbb 01-Feb-2016 Alfred Perlstein <alfred@FreeBSD.org>

Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks


# b114aa79 27-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Make shutdown() return ENOTCONN as required by POSIX, part deux.

Summary:
Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right.

Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries.

I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same.

Test Plan:
This issue was uncovered while writing tests for shutdown() in CloudABI:

https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26

Reviewers: glebius, rwatson, #manpages, gnn, #network

Reviewed By: gnn, #network

Subscribers: bms, mjg, imp

Differential Revision: https://reviews.freebsd.org/D3039


# 1a7c14ae 24-Jul-2015 Xin LI <delphij@FreeBSD.org>

Fix a typo in comment.

Submitted by: Yanhui Shen via twitter
MFC after: 3 days


# 0c40f353 13-Jul-2015 Conrad Meyer <cem@FreeBSD.org>

Fix cleanup race between unp_dispose and unp_gc

unp_dispose and unp_gc could race to teardown the same mbuf chains, which
can lead to dereferencing freed filedesc pointers.

This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS
as invalid/freed. The flag is protected by UNP_LIST_LOCK.

To serialize against unp_gc, unp_dispose needs the socket object. Change the
dom_dispose() KPI to take a socket object instead of an mbuf chain directly.

PR: 194264
Differential Revision: https://reviews.freebsd.org/D3044
Reviewed by: mjg (earlier version)
Approved by: markj (mentor)
Obtained from: mjg
MFC after: 1 month
Sponsored by: EMC / Isilon Storage Division


# e9b70483 23-Feb-2015 Andrey V. Elsukov <ae@FreeBSD.org>

soreceive_generic() still has similar KASSERT(), therefore instead of
remove KASSERT(), change it to check mbuf isn't NULL.

Suggested by: kib
MFC after: 1 week


# f21684bc 23-Feb-2015 Andrey V. Elsukov <ae@FreeBSD.org>

In some cases soreceive_dgram() can return no data, but has control
message. This can happen when application is sending packets too big
for the path MTU and recvmsg() will return zero (indicating no data)
but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU.
Remove KASSERT() which does NULL pointer dereference in such case.
Also call m_freem() only when m isn't NULL.

PR: 197882
MFC after: 1 week
Sponsored by: Yandex LLC


# a76d4388 14-Feb-2015 Davide Italiano <davide@FreeBSD.org>

Don't access sockbuf fields directly, use accessor functions instead.
It is safe to move the call to socantsendmore_locked() after
sbdrop_locked() as long as we hold the sockbuf lock across the two
calls.

CR: D1805
Reviewed by: adrian, kmacy, julian, rwatson


# e834a840 20-Dec-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Revert r274494, r274712, r275955 and provide extra comments explaining
why there could appear a zero-sized mbufs in socket buffers.

A proper fix would be to divorce record socket buffers and stream
socket buffers, and divorce pru_send that accepts normal data from
pru_send that accepts control data.


# 5ad25ceb 15-Dec-2014 John Baldwin <jhb@FreeBSD.org>

Check for SS_NBIO in so->so_state instead of sb->sb_flags in
soreceive_stream().

Differential Revision: https://reviews.freebsd.org/D1299
Reviewed by: bz, gnn
MFC after: 1 week


# 651e4e6a 30-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 0f9d0a73 29-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile:

o Introduce a notion of "not ready" mbufs in socket buffers. These
mbufs are now being populated by some I/O in background and are
referenced outside. This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
freed.

o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
The sb_ccc stands for ""claimed character count", or "committed
character count". And the sb_acc is "available character count".
Consumers of socket buffer API shouldn't already access them directly,
but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
search.
o New function sbready() is provided to activate certain amount of mbufs
in a socket buffer.

A special note on SCTP:
SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs. Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them. The new
notion of "not ready" data isn't supported by SCTP. Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does. This was discussed with rrs@.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 67af272b 19-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Do not allocate zero-length mbuf in sosend_generic().

Found by: pho
Sponsored by: Nginx, Inc.


# 6bf6b25e 14-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile:
Use sbcut_locked() instead of manually editing a sockbuf.

Sponsored by: Nginx, Inc.


# cfa6009e 12-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 71426637 08-Sep-2014 Hiroki Sato <hrs@FreeBSD.org>

- Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() around
the function calls.
- Fix a memory leak and stats in the case that hhook_run_socket() fails
in soalloc().

PR: 193265


# 9e739a5a 06-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix for r271182.

Submitted by: mjg
Pointy hat to: me, submitter and everyone who urged me to commit


# d9257d8b 05-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Set vnet context before accessing V_socket_hhh[].

Submitted by: "Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>


# e86447ca 26-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove socket file operations declaration from sys/file.h.
- Make them static in sys_socket.c.
- Provide generic invfo_truncate() instead of soo_truncate().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# ed063112 21-Aug-2014 Hiroki Sato <hrs@FreeBSD.org>

Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, and
separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT()
handler.

Discussed with: marcel


# 4ec73712 18-Aug-2014 Marcel Moolenaar <marcel@FreeBSD.org>

For vendors like Juniper, extensibility for sockets is important. A
good example is socket options that aren't necessarily generic. To
this end, OSD is added to the socket structure and hooks are defined
for key operations on sockets. These are:
o soalloc() and sodealloc()
o Get and set socket options
o Socket related kevent filters.

One aspect about hhook that appears to be not fully baked is the return
semantics (the return value from the hook is ignored in hhook_run_hooks()
at the time of commit). To support return values, the socket_hhook_data
structure contains a 'status' field to hold return values.

Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Obtained from: Juniper Networks, Inc.


# 4295aa92 03-Aug-2014 Davide Italiano <davide@FreeBSD.org>

Fix an overflow in getsockopt(). optval isn't big enough to hold
sbintime_t.
Re-introduce r255030 behaviour capping socket timeouts to INT_32
if they're too large.

CR: https://phabric.freebsd.org/D433
Reported by: demon
Reviewed by: bde [1], jhb [2]
MFC after: 2 weeks


# 1e0a021e 26-Jul-2014 Marcel Moolenaar <marcel@FreeBSD.org>

The accept filter code is not specific to the FreeBSD IPv4 network stack,
so it really should not be under "optional inet". The fact that uipc_accf.c
lives under kern/ lends some weight to making it a "standard" file.

Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the
need for #ifdef INET in kern/uipc_socket.c.

Also, this meant the net.inet.accf.unloadable sysctl needed to move, as
net.inet does not exist without networking compiled in (as it lives in
netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable.

In order to support existing accept filter sysctls, the net.inet.accf node
has been added netinet/in_proto.c.

Submitted by: Steve Kiernan <stevek@juniper.net>
Obtained from: Juniper Networks, Inc.


# d978bbea 16-Jan-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Simplify wait/nowait code, eventually killing last remnant of
historical mbuf(9) allocator flag.

Sponsored by: Nginx, Inc.


# 16ef0fa8 08-Nov-2013 Hiren Panchasara <hiren@FreeBSD.org>

Fix typo in a comment.


# f7a3a2a5 31-Oct-2013 Maksim Yevmenkin <emax@FreeBSD.org>

Rate limit (to once per minute) "Listen queue overflow" message in
sonewconn().

Reviewed by: scottl, lstewart
Obtained from: Netflix, Inc
MFC after: 2 weeks


# 3846a822 16-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.

Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)


# 7729cbf1 01-Sep-2013 Davide Italiano <davide@FreeBSD.org>

Fix socket buffer timeouts precision using the new sbintime_t KPI instead
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.

Reported by: Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by: jhb (earlier version)


# e289e9f2 29-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Don't return an error for socket timeouts that are too large. Just
cap them to INT_MAX ticks instead.

PR: kern/181416 (r254699 really)
Requested by: bde
MFC after: 2 weeks


# e77c507d 23-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Use tvtohz() to convert a socket buffer timeout to a tick value rather
than using a home-rolled version. The home-rolled version could result
in shorter-than-requested sleeps.

Reported by: Vitja Makarov <vitja.makarov@gmail.com>
MFC after: 2 weeks


# 6753da13 08-May-2013 Andre Oppermann <andre@FreeBSD.org>

When the accept queue is full print the number of already pending
new connections instead of by how many we're over the limit, which
is always 1.

Noticed by: jmallet
MFC after: 1 week


# f89d4c3a 06-May-2013 Andre Oppermann <andre@FreeBSD.org>

Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.


# cd31b6dd 30-Apr-2013 Jilles Tjoelker <jilles@FreeBSD.org>

socket: Make shutdown() wake up a blocked accept().

A blocking accept (and some other operations) waits on &so->so_timeo. Once
it wakes up, it will detect the SBS_CANTRCVMORE bit.

The error from accept() is [ECONNABORTED] which is not the nicest one -- the
thread calling accept() needs to know out-of-band what is happening.

A spurious wakeup on so->so_timeo appears harmless (sleep retried) except
when lingering on close (SO_LINGER, and in that case there is no descriptor
to call shutdown() on) so this should be fairly safe.

A shutdown() already woke up a blocked accept() for TCP sockets, but not for
Unix domain sockets. This fix is generic for all domains.

This patch was sent to -hackers@ and -net@ on April 5.

MFC after: 2 weeks


# d58a9653 09-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Fix the build.


# e8b3186b 09-Apr-2013 Andre Oppermann <andre@FreeBSD.org>

Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.


# a307eb26 29-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

When soreceive_generic() hands off an mbuf from buffer,
clear its pointer to next record, since next record
belongs to the buffer, and shouldn't be leaked.

The ng_ksocket(4) used to clear this pointer itself,
but the correct place is here.

Sponsored by: Nginx, Inc


# c2e3c52e 19-Mar-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.

This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.

The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.

The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.

For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.

Reviewed by: kib


# fbb34710 11-Mar-2013 Michael Tuexen <tuexen@FreeBSD.org>

Return an error if sctp_peeloff() fails because a socket can't be allocated.

MFC after: 3 days


# 7493f24e 02-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Implement two new system calls:

int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des


# 6e0b6746 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Configure UMA warnings for the following zones:
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached

Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.

Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 94b0ae5d 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Make socket_zone static - it is used only in this file.
- Update maxsockets on uma_zone_set_max().

Obtained from: WHEEL Systems


# 68412f41 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Style cleanups.


# b08d12d9 06-Dec-2012 Kevin Lo <kevlo@FreeBSD.org>

- according to POSIX, make socket(2) return EAFNOSUPPORT rather than
EPROTONOSUPPORT if the address family is not supported.
- introduce pffinddomain() to find a domain by family and use it as
appropriate.

Reviewed by: glebius


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 358c7f47 27-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Fix r243627 by testing against the head socket instead of the socket
just created.

MFC after: 1 week
X-MFC-with: r243627


# ead46972 27-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Base the mbuf related limits on the available physical memory or
kernel memory, whichever is lower. The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.

The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline. In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily. Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.

At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers. This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.

Tidy up ordering in init_param2() and check up on some users of
those values calculated here.

Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes. 2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets. The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used
these large clusters will end up only on the inbound path. They are
not used on outbound, there it's still 4K. Yes, that will stay that
way because otherwise we run into lots of complications in the
stack. And it really isn't a problem, so don't make a scene.

Normal mbufs (256B) weren't limited at all previously. This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.

The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster. Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.

NB: Every cluster also has an mbuf associated with it.

Two examples on the revised mbuf sizing limits:

1GB KVM:
512MB limit for mbufs
419,430 mbufs
65,536 2K mbuf clusters
32,768 4K mbuf clusters
9,709 9K mbuf clusters
5,461 16K mbuf clusters

16GB RAM:
8GB limit for mbufs
33,554,432 mbufs
1,048,576 2K mbuf clusters
524,288 4K mbuf clusters
155,344 9K mbuf clusters
87,381 16K mbuf clusters

These defaults should be sufficient for even the most demanding
network loads.

MFC after: 1 month


# 2c3142c8 27-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Fix a race on listen socket teardown where while draining the
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.

The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.

Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by: His team.
MFC after: 1 week


# e8ad36ab 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

In soreceive_stream() don't drop an already dequeued mbuf chain by
overwriting the return mbuf pointer with newly received data after
a loop. Instead append the new mbuf chain to the existing one.

Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.

For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length. Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.

Fix the MSG_WAITALL case by comparing against sb_hiwat. Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.

Submitted by: trociny (except for the change in last paragraph)


# fdd1b7f5 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Add logging for socket attach failures in sonewconn() during accept(2).
Include the pointer to the PCB so it can be attributed to a particular
application by corresponding it to "netstat -A" output.

MFC after: 2 weeks


# e37e60c3 23-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Replace the ill-named ZERO_COPY_SOCKET kernel option with two
more appropriate named kernel options for the very distinct
send and receive path.

"options SOCKET_SEND_COW" enables VM page copy-on-write based
sending of data on an outbound socket.

NB: The COW based send mechanism is not safe and may result
in kernel crashes.

"options SOCKET_RECV_PFLIP" enables VM kernel/userspace page
flipping for special disposable pages attached as external
storage to mbufs.

Only the naming of the kernel options is changed and their
corresponding #ifdef sections are adjusted. No functionality
is added or removed.

Discussed with: alc (mechanism and limitations of send side COW)


# dc00208e 20-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Grammar fixes to r241781.

Submitted by: alc


# 2bdf61ca 19-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a
output and replace it with a new visible sysctl kern.ipc.acceptqueue
of the same functionality. It specifies the maximum length of the
accept queue on a listen socket.

The old kern.ipc.somaxconn remains available for reading and writing
for compatibility reasons so that existing programs, scripts and
configurations continue to work. There no plans to ever remove the
orginal and now hidden kern.ipc.somaxconn.


# 1490de00 20-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Tidy up somaxconn (accept queue limit) and related functions
and move it together into one place.


# 4b62fe5b 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Move socket UMA zone initialization functionality together into
one place.


# cf8e6069 19-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Move UMA socket zone initialization from uipc_domain.c to uipc_socket.c
into one place next to its other related functions to avoid confusion.


# d10733a8 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Remove unnecessary includes from sosend_copyin() and fix
a couple of style issues.


# 1d147759 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within
zero copy specialized sosend_copyin() helper function.


# 48b5c741 02-Oct-2012 Garrett Wollman <wollman@FreeBSD.org>

Fix spelling of the function name in two assertion messages.


# bb9f214f 02-Sep-2012 Mikolaj Golub <trociny@FreeBSD.org>

In soreceive_generic() remove the optimization for the case when
MSG_WAITALL is set, and it is possible to do the entire receive
operation at once if we block (resid <= hiwat). Actually it might make
the recv(2) with MSG_WAITALL flag get stuck when there is enough space
in the receiver buffer to satisfy the request but not enough to open
the window closed previously due to the buffer being full.

The issue can be reproduced using the following scenario:

On the sender side do 2 send(2) requests:

1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10);
2) data of size equal to SOBUF_SIZE.

On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set:

1) recv() data of SOBUF_SIZE / 10 size;
2) recv() data of SOBUF_SIZE size;

We totally fill the receiver buffer with one SOBUF_SIZE/10 size request
and partial SOBUF_SIZE request. When the first request is processed we
get SOBUF_SIZE/10 free space. It is just enough to receive the rest of
bytes for the second request, and soreceive_generic() blocks in the
part that is a subject of this change waiting for the rest. But the
window was closed when the buffer was filled and to avoid silly window
syndrome it opens only when available space is larger than sb_hiwat/4
or maxseg. So it is stuck and pending data is only sent via TCP window
probes.

Discussed with: kib (long ago)
MFC after: 2 weeks


# 2ad099fc 02-Sep-2012 Mikolaj Golub <trociny@FreeBSD.org>

In soreceive_generic() when checking if the type of mbuf has changed
check it for MT_CONTROL type too, otherwise the assertion
"m->m_type == MT_DATA" below may be triggered by the following scenario:

- the sender sends some data (MT_DATA) and then a file descriptor
(MT_CONTROL);
- the receiver calls recv(2) with a MSG_WAITALL asking for data larger
than the receive buffer (uio_resid > hiwat).

MFC after: 2 week


# e71a7957 03-Jul-2012 Mikolaj Golub <trociny@FreeBSD.org>

Fix KASSERT message.

MFC after: 3 days


# 60a30588 03-Apr-2012 Navdeep Parhar <np@FreeBSD.org>

- Remove redundant call to pr_ctloutput from code that handles SO_SETFIB.
- Add a check for errors during copyin while here.

Reviewed by: julian, bz
MFC after: 2 weeks


# 747d2fa1 26-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the
socket protocol number. This is useful since the socket type can
be implemented by different protocols in the same protocol family,
e.g. SOCK_STREAM may be provided by both TCP and SCTP.

Submitted by: Jukka A. Ukkonen <jau iki fi>
PR: kern/162352
Discussed with: bz
Reviewed by: glebius
MFC after: 2 weeks


# 9493639e 26-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove apparently redundand checks for socket so_proto being non-NULL
from sosetopt() and sogetopt(). No exposed sockets may have so_proto
invalid.

Discussed with: bz, rwatson
Reviewed by: glebius
MFC after: 2 weeks


# 526d0bd5 20-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# cf8b8325 04-Feb-2012 Hiroki Sato <hrs@FreeBSD.org>

Fix input validation in SO_SETFIB.

Reviewed by: bz
MFC after: 1 day


# ee799639 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Add SO_SETFIB option support on PF_INET6 sockets and allow inheriting the
FIB number from the process, as set by setfib(2), on socket creation.

Sponsored by: Cisco Systems, Inc.


# ea4d9a14 14-Nov-2011 Robert Millan <rmh@FreeBSD.org>

Remove a few bits of FreeBSD 2.x compatibility code.

Approved by: kib (mentor)


# 6aba400a 25-Aug-2011 Attilio Rao <attilio@FreeBSD.org>

Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks


# 695da99e 08-Jul-2011 Andre Oppermann <andre@FreeBSD.org>

In the experimental soreceive_stream():

o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF
is correctly returned on a remote connection close.
o In the non-blocking socket test compare SS_NBIO against the so->so_state
field instead of the incorrect sb->sb_state field.
o Simplify the ENOTCONN test by removing cases that can't occur.

Submitted by: trociny (with some further tweaks by committer)
Tested by: trociny


# 1c6e7fa7 07-Jul-2011 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP_SORECEIVE_STREAM compile time option. The use of
soreceive_stream() for TCP still has to be enabled with the loader
tuneable net.inet.tcp.soreceive_stream.

Suggested by: trociny and others


# 3204c8e5 29-May-2011 Mikolaj Golub <trociny@FreeBSD.org>

In soreceive_generic(), if MSG_WAITALL is set but the request is
larger than the receive buffer, we have to receive in sections.
When notifying the protocol that some data has been drained the
lock is released for a moment. Returning we block waiting for the
rest of data. There is a race, when data could arrive while the
lock was released and then the connection stalls in sbwait.

Fix this by checking for data before blocking and skip blocking
if there are some.

PR: kern/154504
Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Reviewed by: rwatson
Approved by: kib (co-mentor)
MFC after: 2 weeks


# 1fb51a12 16-Feb-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks


# f7e6ce6d 12-Feb-2011 Daniel Eischen <deischen@FreeBSD.org>

Allow the SO_SETFIB socket option to select the default (0)
routing table.

Reviewed by: julian


# 0028e524 11-Feb-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 CH=177255:

Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS.

Change the syntax to match KASSERT() to allow more flexible panic
messages rather than having a printf with hardcoded arguments
before panic.

Adjust the few assertions we have to the new format (and enhance
the output).

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb

MFC after: 2 weeks


# 5c9d0a9a 12-Nov-2010 Luigi Rizzo <luigi@FreeBSD.org>

This commit implements the SO_USER_COOKIE socket option, which lets
you tag a socket with an uint32_t value. The cookie can then be
used by the kernel for various purposes, e.g. setting the skipto
rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has
been implemented; however there is nothing ipfw-specific in its
implementation).

The ipfw-related code that uses the optopn will be committed separately.

This change adds a field to 'struct socket', but the struct is not
part of any driver or userland-visible ABI so the change should be
harmless.

See the discussion at
http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html

Idea and code from Paul Joe, small modifications and manpage
changes by myself.

Submitted by: Paul Joe
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# adb6aa9a 18-Sep-2010 Robert Watson <rwatson@FreeBSD.org>

With reworking of the socket life cycle in 7.x, the need for a "sotryfree()"
was eliminated: all references to sockets are explicitly managed by sorele()
and the protocols. As such, garbage collect sotryfree(), and update
sofree() comments to make the new world order more clear.

MFC after: 3 days
Reported by: Anuranjan Shukla <anshukla at juniper dot net>


# af9ba7d8 07-Aug-2010 Michael Tuexen <tuexen@FreeBSD.org>

Fix a bug where MSG_TRUNC was not returned in all necessary cases for
SOCK_DGRAM socket. MSG_TRUNC was only returned when some mbufs could
not be copied to the application. If some data was left in the last
mbuf, it was correctly discarded, but MSG_TRUNC was not set.

Reviewed by: bz
MFC after: 3 weeks


# 9b47fa59 01-Jun-2010 Robert Watson <rwatson@FreeBSD.org>

Merge r208601 from head to stable/8:

When close() is called on a connected socket pair, SO_ISCONNECTED might be
set but be cleared before the call to sodisconnect(). In this case,
ENOTCONN is returned: suppress this error rather than returning it to
userspace so that close() doesn't report an error improperly.

PR: kern/144061
Reported by: Matt Reimer <mreimer at vpop.net>,
Nikolay Denev <ndenev at gmail.com>,
Mikolaj Golub <to.my.trociny at gmail.com>

Approved by: re (kib)


# e35973e4 27-May-2010 Robert Watson <rwatson@FreeBSD.org>

When close() is called on a connected socket pair, SO_ISCONNECTED might be
set but be cleared before the call to sodisconnect(). In this case,
ENOTCONN is returned: suppress this error rather than returning it to
userspace so that close() doesn't report an error improperly.

PR: kern/144061
Reported by: Matt Reimer <mreimer at vpop.net>,
Nikolay Denev <ndenev at gmail.com>,
Mikolaj Golub <to.my.trociny at gmail.com>
MFC after: 3 days


# 4ccf64eb 06-Apr-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

MFC r205014,205015:

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

This MFC is required for MFCs of later changes to the freebsd32
compatibility from HEAD.

Requested by: kib


# 67208dfa 27-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r204147:

Set curvnet earlier so that it also covers calls to sodisconnect(), which
before were possibly panicing the system in ULP code in the VIMAGE case.

Submitted by: Igor (igor ispsystem.com)


# 841c0c7e 11-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 0a68a459 20-Feb-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Set curvnet earlier so that it also covers calls to sodisconnect(), which
before were possibly panicing the system in ULP code in the VIMAGE case.

Submitted by: Igor (igor ispsystem.com)
MFC after: 5 days


# 70785093 14-Dec-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r197720 from head to stable/8:

Don't comment on stream socket handling in sosend_dgram, since that's
not handled.


# afd8e45b 02-Oct-2009 Robert Watson <rwatson@FreeBSD.org>

Don't comment on stream socket handling in sosend_dgram, since that's
not handled.

MFC after: 3 weeks


# 11c99a6d 15-Sep-2009 Andre Oppermann <andre@FreeBSD.org>

-Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by: brooks

Once compiled in make it easily switchable for testers by using a tuneable
net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by: rwatson

MFC after: 2 days


# e76d823b 12-Sep-2009 Robert Watson <rwatson@FreeBSD.org>

Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks


# 74d1c492 25-Aug-2009 Jilles Tjoelker <jilles@FreeBSD.org>

Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.

This reverts part of r196460, so that sockets only return POLLHUP if both
directions are closed/error. Fifos get POLLHUP by closing the unused
direction immediately after creating the sockets.

The tools/regression/poll/*poll.c tests now pass except for two other things:
- if POLLHUP is returned, POLLIN is always returned as well instead of only
when there is data left in the buffer to be read
- fifo old/new reader distinction does not work the way POSIX specs it

Reviewed by: kib, bde


# f2159cc7 22-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix the conformance of poll(2) for sockets after r195423 by
returning POLLHUP instead of POLLIN for several cases. Now, the
tools/regression/poll results for FreeBSD are closer to that of the
Solaris and Linux.

Also, improve the POSIX conformance by explicitely clearing POLLOUT
when POLLHUP is reported in pollscan(), making the fix global.

Submitted by: bde
Reviewed by: rwatson
MFC after: 1 week


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 7973fba3 28-Jul-2009 Julian Elischer <julian@FreeBSD.org>

Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by: ambrisko
Approved by: re (kib)
MFC after: 3 days


# 006e9db4 19-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Normalize field naming for struct vnet, fix two debugging printfs that
print them.

Reviewed by: bz
Approved by: re (kensmith, kib)


# 7f5dff50 07-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix poll(2) and select(2) for named pipes to return "ready for read"
when all writers, observed by reader, exited. Use writer generation
counter for fifo, and store the snapshot of the fifo generation in the
f_seqcount field of struct file, that is otherwise unused for fifos.
Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is
equal to fifo' fi_wgen, and revert r89376.

Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them.
Note that the patch does not fix not returning POLLHUP for fifos.

PR: kern/94772
Submitted by: bde (original version)
Reviewed by: rwatson, jilles
Approved by: re (kensmith)
MFC after: 6 weeks (might be)


# ef760e6a 22-Jun-2009 Andre Oppermann <andre@FreeBSD.org>

Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
the length of data to be moved into the uio compared to
soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams. Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets. Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4: r164817 //depot/user/andre/soreceive_stream/


# 9ed47d01 15-Jun-2009 Jamie Gritton <jamie@FreeBSD.org>

Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by: zec, julian
Approved by: bz (mentor)


# d8b0556c 10-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# f93bfb23 02-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from: TrustedBSD Project


# 74fb0ba7 01-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# 2114e063 08-May-2009 Marko Zec <zec@FreeBSD.org>

A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by: bz
Approved by: julian (mentor) (an earlier version of the diff)


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# f6dfe47a 30-Apr-2009 Marko Zec <zec@FreeBSD.org>

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# ca04ba64 05-Feb-2009 Jamie Gritton <jamie@FreeBSD.org>

Don't allow creating a socket with a protocol family that the current
jail doesn't support. This involves a new function prison_check_af,
like prison_check_ip[46] but that checks only the family.

With this change, most of the errors generated by jailed sockets
shouldn't ever occur, at least until jails are changeable.

Approved by: bz (mentor)


# fd4f1ebd 04-Feb-2009 Robert Watson <rwatson@FreeBSD.org>

Remove written-to but never read local variable 'offset' from
soreceive_dgram().

Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de>
MFC after: 1 week


# 62938659 10-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Make sure nmbclusters are initialized before maxsockets
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.

Reviewed by: rwatson, zec
Sponsored by: The FreeBSD Foundation
MFC after: 4 weeks


# 36b5ba0c 10-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Style changes only. Put the return type on an extra line[1] and
add an empty line at the beginning as we do not have any local
variables.

Submitted by: rwatson [1]
Reviewed by: rwatson
MFC after: 4 weeks


# 413628a7 29-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# b4cf0e62 21-Nov-2008 Konstantin Belousov <kib@FreeBSD.org>

Add sv_flags field to struct sysentvec with intention to provide description
of the ABI of the currently executing image. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures to determine ABI features.

Discussed with: dchagin, imp, jhb, peter


# bc97ba51 19-Nov-2008 Julian Elischer <julian@FreeBSD.org>

Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from: Ironport
MFC after: 3 days


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 5a1760fc 16-Oct-2008 Kip Macy <kmacy@FreeBSD.org>

make sure that SO_NO_DDP and SO_NO_OFFLOAD get passed in correctly

PR: 127360
MFC after: 3 days


# ff601c36 07-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

In soreceive_dgram, when a 0-length buffer is passed into recv(2) and
no data is ready, return 0 rather than blocking or returning EAGAIN.
This is consistent with the behavior of soreceive_generic (soreceive)
in earlier versions of FreeBSD, and restores this behavior for UDP.

Discussed with: jhb, sam
MFC after: 3 days


# ffe72750 07-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

Remove temporary debugging KASSERT's introduced to detect protocols
improperly invoking sosend(), soreceive(), and sopoll() instead of
attach either specialized or _generic() versions of those functions
to their pru_sosend, pru_soreceive, and pru_sopoll protosw methods.

MFC after: 3 days


# 1af1c6cd 01-Oct-2008 John Baldwin <jhb@FreeBSD.org>

Wait until after dropping the receive socket buffer lock to allocate space
to store the socket address stored in the first mbuf in a packet chain.
This reduces contention on the lock and CPU system time in certain UDP
workloads.

Tested by: ps
Reviewed by: rwatson
MFC after: 1 week


# 25edc6dd 01-Oct-2008 Robert Watson <rwatson@FreeBSD.org>

Various cleanups for soreceive_dgram():

- Update or remove comments that were left over from the original
soreceive_generic() implementation. Quite a few were misleading in the
context of the new code.
- Since soreceive_dgram() has a simpler structure, replace several gotos
with a while loop making the invariants more clear.
- In the blocking while loop, don't try to handle cases incompatible with
the loop invariant (since m is always NULL, don't check for and handle
non-NULL).
- Don't drop and re-acquire the socket buffer lock unnecessarily after
sbwait() returns, which may help reduce lock contention (etc).
- Assume PR_ATOMIC since we assert it at the top of the function.

MFC after: 3 days


# c4688866 30-Sep-2008 John Baldwin <jhb@FreeBSD.org>

Update the function name in several assertions in soreceive_dgram().

Approved by: rwatson
MFC after: 3 days


# 26ec197d 02-Sep-2008 Robert Watson <rwatson@FreeBSD.org>

Remove XXXRW in soreceive_dgram that proves unnecessary.

Remove unused orig_resid variable in soreceive_dgram.

Submitted by: alfred
X-MFC with: soreceive_dgram (r180198, r180211)


# dd0e6c38 20-Jul-2008 Kip Macy <kmacy@FreeBSD.org>

Add accessor functions for socket fields.

MFC after: 1 week


# 6992381e 03-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Update copyright date in light of soreceive_dgram(9).


# 5df3e839 02-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by: kris, ps


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# cf71e438 14-Apr-2008 Randall Stewart <rrs@FreeBSD.org>

Add pru_flush routine so a transport can
flush itself during Shutdown

MFC after: 1 week


# ea26d587 25-Mar-2008 Ruslan Ermilov <ru@FreeBSD.org>

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


# 073d8ba4 19-Mar-2008 Maxim Sobolev <sobomax@FreeBSD.org>

Revert previous change - it appears that the limit I was hitting was a
maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.

Requested by: rwhatson, jeff


# c9370ff4 16-Mar-2008 Maxim Sobolev <sobomax@FreeBSD.org>

Properly set size of the file_zone to match kern.maxfiles parameter.
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.

MFC after: 2 weeks


# 3f0bfccc 03-Feb-2008 Robert Watson <rwatson@FreeBSD.org>

Further clean up sorflush:

- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
and destroying a socket buffer lock for the temporary stack copy of the
actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
things have gotten less ugly in sorflush() lately.

This makes socket close cleaner, and possibly also marginally faster.

MFC after: 3 weeks


# 265de5bb 31-Jan-2008 Robert Watson <rwatson@FreeBSD.org>

Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 041b706b 04-Jun-2007 David Malone <dwmalone@FreeBSD.org>

Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# d19e16a7 16-May-2007 Robert Watson <rwatson@FreeBSD.org>

Generally migrate to ANSI function headers, and remove 'register' use.


# ccd8d954 07-May-2007 Pyun YongHyeon <yongari@FreeBSD.org>

Add missing socket buffer unlock before returning to userland.

Reviewed by: rwatson


# 7abab911 03-May-2007 Robert Watson <rwatson@FreeBSD.org>

sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.


# 8c799760 26-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Following movement of functions from uipc_socket2.c to uipc_socket.c and
uipc_sockbuf.c, clean up and update comments.


# 20d9e5e8 26-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Complete removal of uipc_socket2.c by moving the last few functions to
other C files:

- Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c. While
sbcreatecontrol() is really an mbuf allocation routine, it does its work
with awareness of the layout of socket buffer memory.

- Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub
versions of several of these functions live. Likewise, move socket state
transition calls (soisconnecting(), etc) to uipc_socket.c. Moveo
sodupsockaddr() and sotoxsocket().


# cd68a3f7 22-Mar-2007 Gleb Smirnoff <glebius@FreeBSD.org>

Move the dom_dispose and pru_detach calls in sofree() earlier. Only after
calling pru_detach we can be absolutely sure, that we don't have any
references to the socket in the stack.

This closes race between lockless sbdestroy() and data arriving on socket.

Reviewed by: rwatson


# 75685034 12-Mar-2007 John Baldwin <jhb@FreeBSD.org>

- Use m_gethdr(), m_get(), and m_clget() instead of the macros in
sosend_copyin().
- Use M_WAITOK instead of M_TRYWAIT in sosend_copyin().
- Don't check for NULL from M_WAITOK and return ENOBUFS.
M_WAITOK/M_TRYWAIT allocations don't fail with NULL.

Reviewed by: andre
Requested by: andre (2)


# fac61393 26-Feb-2007 Ruslan Ermilov <ru@FreeBSD.org>

Don't block on the socket zone limit during the socket()
call which can easily lock up a system otherwise; instead,
return ENOBUFS as documented in a manpage, thus reverting
us to the FreeBSD 4.x behavior.

Reviewed by: rwatson
MFC after: 2 weeks


# f58dd470 15-Feb-2007 Robert Watson <rwatson@FreeBSD.org>

Rename somaxconn_sysctl() to sysctl_somaxconn() so that I will be able to
claim that sofoo() functions all accept a socket as their first argument.


# 7dc8d021 02-Feb-2007 Bruce M Simpson <bms@FreeBSD.org>

Diff reduction with RELENG_6, style(9):
Remove unnecessary brace; && should be on end of line.
No functional changes.


# 6a37f331 01-Feb-2007 Andre Oppermann <andre@FreeBSD.org>

Generic socket buffer auto sizing support, header defines, flag inheritance.

MFC after: 1 month


# 7c32173b 22-Jan-2007 Andre Oppermann <andre@FreeBSD.org>

Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary
control data but no payload data is passed.

Change m_uiotombuf() to return at least one empty mbuf if the requested
length was zero. Add comment to sosend_dgram and sosend_generic().

Diagnoses by: jhb
Regression test by: rwatson
Pointy hat to. andre


# abdeb3b0 08-Jan-2007 Robert Watson <rwatson@FreeBSD.org>

Canonicalize copyrights in some files I hold copyrights on:

- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.

MFC after: 3 days


# a86ec338 23-Dec-2006 Bruce M Simpson <bms@FreeBSD.org>

Drop all received data mbufs from a socket's queue if the MT_SONAME
mbuf is dropped, to preserve the invariant in the PR_ADDR case.

Add a regression test to detect this condition, but do not hook it
up to the build for now.

PR: kern/38495
Submitted by: James Juran
Reviewed by: sam, rwatson
Obtained from: NetBSD
MFC after: 2 weeks


# 84eab9ad 22-Nov-2006 Mohan Srinivasan <mohans@FreeBSD.org>

Fix a race in soclose() where connections could be queued to the
listening socket after the pass that cleans those queues. This
results in these connections being orphaned (and leaked). The fix
is to clean up the so queues after detaching the socket from the
protocol. Thanks to ups and jhb for discussions and a thorough code
review.


# 1ae4d97d 02-Nov-2006 Andre Oppermann <andre@FreeBSD.org>

Use the improved m_uiotombuf() function instead of home grown sosend_copyin()
to do the userland to kernel copying in sosend_generic() and sosend_dgram().

sosend_copyin() is retained for ZERO_COPY_SOCKETS which are not yet supported
by m_uiotombuf().

Benchmaring shows significant improvements (95% confidence):
66% less cpu (or 2.9 times better) with new sosend vs. old sosend (non-TSO)
65% less cpu (or 2.8 times better) with new sosend vs. old sosend (TSO)

(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)

Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 4a75dc25 22-Sep-2006 Bruce M Simpson <bms@FreeBSD.org>

Fix a case where socket I/O atomicity is violated due to not dropping
the entire record when a non-data mbuf is removed in the soreceive() path.
This only triggers a panic directly when compiled with INVARIANTS.

PR: 38495
Submitted by: James Juran
MFC after: 1 week


# 689f94bf 13-Sep-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix a lock leak in an error case.

Reported by: netchild
Reviewed by: rwatson


# 805def2e 10-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

New sockets created by incoming connections into listen sockets should
inherit all settings and options except listen specific options.

Add the missing send/receive timeouts and low watermarks.
Remove inheritance of the field so_timeo which is unused.

Noticed by: phk
Reviewed by: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# daa5817e 18-Aug-2006 George V. Neville-Neil <gnn@FreeBSD.org>

Fix a kernel panic based on receiving an ICMPv6 Packet too Big message.

PR: 99779
Submitted by: Jinmei Tatuya
Reviewed by: clement, rwatson
MFC after: 1 week


# 79ad81c0 11-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Before performing a sodealloc() when pru_attach() fails, assert that
the socket refcount remains 1, and then drop to 0 before freeing the
socket.

PR: 101763
Reported by: Gleb Kozyrev <gkozyrev at ukr dot net>


# 9126410f 02-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Move destroying kqueue state from above pru_detach to below it in
sofree(), as a number of protocols expect to be able to call
soisdisconnected() during detach. That may not be a good assumption,
but until I'm sure if it's a good assumption or not, allow it.


# c0e1415d 01-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Move updated of 'numopensockets' from bottom of sodealloc() to the top,
eliminating a second set of identical mutex operations at the bottom.
This allows brief exceeding of the max sockets limit, but only by
sockets in the last stages of being torn down.


# eaa6dfbc 01-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Reimplement socket buffer tear-down in sofree(): as the socket is no
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc. Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer. A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.

This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.


# b0668f71 24-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod: sam, gnn, wollman


# 809c2b78 23-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Update various uipc_socket.c comments, and reformat others.


# a152f8a3 21-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


# 5cd1a271 16-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Change comment on soabort() to more accurately describe how/when
soabort() is used. Remove trailing white space.


# 5908c617 11-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel)
return void, so don't implement no-op versions of these functions.
Instead, consistently check if those switch pointers are NULL before
invoking them.


# f949ae9b 11-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

When pru_attach() fails, call sodealloc() on the socket rather than
using sorele() and the full tear-down path. Since protocol state
allocation failed, this is not required (and is arguably undesirable).
This matches the behavior of sonewconn() under the same circumstances.


# 721150ad 18-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

When retrieving SO_ERROR via getsockopt(), hold the socket lock around
the retrieval and replacement with 0.

MFC after: 1 week


# b37ffd31 10-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

Move some functions and definitions from uipc_socket2.c to uipc_socket.c:

- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after: 1 month


# e02421f3 08-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

Rearrange code in soalloc() so that it's less indented by returning
early if uma_zalloc() from the socket zone fails. No functional
change.

MFC after: 1 week


# 0cec9959 23-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Assert that sockets passed into soabort() not be SQ_COMP or SQ_INCOMP,
since that removal should have been done a layer up.

MFC after: 3 months


# 28ea1801 23-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Add missing 'not' to SQ_COMP comment.

MFC after: 3 months


# 6ca35d4b 23-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Move handling of SQ_COMP exception case in sofree() to the top of the
function along with the remainder of the reference checking code. Move
comment from body to header with remainder of comments. Inclusion of a
socket in a completed connection queue counts as a true reference, and
should not be handled as an under-documented edge case.

MFC after: 3 months


# bc725eaf 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.

MFC after: 3 months


# ac45e92f 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.

MFC after: 3 months


# 7f689de2 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Assert so->so_pcb is NULL in sodealloc() -- the protocol state should not
be present at this point. We will eventually remove this assert because
the socket layer should never look at so_pcb, but for now it's a useful
debugging tool.

MFC after: 3 months


# 220c1357 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Add a somewhat sizable comment documenting the semantics of various kernel
socket calls relating to the creation and destruction of sockets. This
will eventually form the foundation of socket(9), but is currently in too
much flux to do so.

MFC after: 3 months


# 92c07a34 16-Mar-2006 Robert Watson <rwatson@FreeBSD.org>

Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.


# 93709ad0 14-Mar-2006 Robert Watson <rwatson@FreeBSD.org>

As with socket consumer references (so_count), make sofree() return
without GC'ing the socket if a strong protocol reference to the socket
is present (SS_PROTOREF).


# 13f322c2 12-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Improve consistency of return() style.

MFC after: 3 days


# b8ae1cd6 13-Jan-2006 Robert Watson <rwatson@FreeBSD.org>

Add sosend_dgram(), a greatly reduced and simplified version of sosend()
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin(). Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
follow-up.


# 398293a8 29-Nov-2005 John Baldwin <jhb@FreeBSD.org>

Fix snderr() to not leak the socket buffer lock if an error occurs in
sosend(). Robert accidentally changed the snderr() macro to jump to the
out label which assumes the lock is already released rather than the
release label which drops the lock in his previous change to sosend().
This should fix the recent panics about returning from write(2) with the
socket lock held and the most recent LOR on current@.


# 66dd8a6f 28-Nov-2005 Robert Watson <rwatson@FreeBSD.org>

Move zero copy statistics structure before sosend_copyin().

MFC after: 1 month
Reported by: tinderbox, sam


# a725629c 28-Nov-2005 Robert Watson <rwatson@FreeBSD.org>

Break out functionality in sosend() responsible for building mbuf
chains and copying in mbufs from the body of the send logic, creating
a new function sosend_copyin(). This changes makes sosend() almost
readable, and will allow the same logic to be used by tailored socket
send routines.

MFC after: 1 month
Reviewed by: andre, glebius


# 34333b16 02-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# d374e81e 30-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.


# 53f5742d 26-Oct-2005 Paul Saab <ps@FreeBSD.org>

Allow 32bit get/setsockopt with SO_SNDTIMEO or SO_RECVTIMEO to work.


# 8434c29b 18-Sep-2005 Robert Watson <rwatson@FreeBSD.org>

Add three new read-only socket options, which allow regression tests
and other applications to query the state of the stack regarding the
accept queue on a listen socket:

SO_LISTENQLIMIT Return the value of so_qlimit (socket backlog)
SO_LISTENQLEN Return the value of so_qlen (complete sockets)
SO_LISTENINCQLEN Return the value of so_incqlen (incomplete sockets)

Minor white space tweaks to existing socket options to make them
consistent.

Discussed with: andre
MFC after: 1 week


# bc6b8b5d 18-Sep-2005 Robert Watson <rwatson@FreeBSD.org>

Fix spelling in a comment.

MFC after: 3 days


# aada5ccc 15-Sep-2005 Maxim Konovalov <maxim@FreeBSD.org>

Backout rev. 1.246, it breaks code uses shutdown(2) on non-connected
sockets.

Pointed out by: rwatson


# c5cff170 15-Sep-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Return ENOTCONN when shutdown(2) on non-connected socket.

PR: kern/84761
Submitted by: James Juran
R-test: tools/regression/sockets/shutdown
MFC after: 1 month


# 016e6212 06-Sep-2005 Gleb Smirnoff <glebius@FreeBSD.org>

In soreceive(), when a first mbuf is removed from socket buffer use
sockbuf_pushsync(). Previous manipulation could lead to an inconsistent
mbuf.

Reviewed by: rwatson


# dcb5fef5 01-Aug-2005 Kelly Yancey <kbyanc@FreeBSD.org>

Make getsockopt(..., SOL_SOCKET, SO_ACCEPTCONN, ...) work per IEEE Std
1003.1 (POSIX).


# 0d52d7b0 28-Jul-2005 George V. Neville-Neil <gnn@FreeBSD.org>

Fix for PR 83885.

Make sure that there actually is a next packet before setting
nextrecord to that field.

PR: 83885
Submitted by: hirose@comm.yamaha.co.jp
Obtained from: Patch suggested in the PR
MFC after: 1 week


# 571dcd15 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# fc74a9f9 10-Jun-2005 Brooks Davis <brooks@FreeBSD.org>

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


# 8bde9359 09-Jun-2005 Scott Long <scottl@FreeBSD.org>

Drat! Committed from the wrong branch. Restore HEAD to its previous goodness.


# 76b472db 09-Jun-2005 Scott Long <scottl@FreeBSD.org>

Back out 1.68.2.26. It was a mis-guided change that was already backed out
of HEAD and should not have been MFC'd. This will restore UDP socket
functionality, which will correct the recent NFS problems.

Submitted by: rwatson


# 92dd256b 05-Jun-2005 Andrew Gallatin <gallatin@FreeBSD.org>

Allow sends sent from non page-aligned userspace addresses to be
considered for zero-copy sends.

Reviewed by: alc
Submitted by: Romer Gil at Rice University


# a59f81d2 11-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option
from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it
now matches do_setopt_accept_filter(). Slightly reformulate the logic to
match the optimistic allocation of storage for the argument in advance,
and slightly expand the coverage of the socket lock.


# 56856fbf 11-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

Remove an additional commented out reference to a possible future sx
lock.


# 2b37548a 11-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

When setting up a socket in socreate(), there's no need to lock the
socket lock around knlist_init(), so don't.

Hard code the setting of the socket reference count to 1 rather than
using soref() to avoid asserting the socket lock, since we've not yet
exposed the socket to other threads.

This removes two mutex operations from each socket allocation.


# 5fab68b1 11-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

Remove suggestive sx_init() comment in soalloc(). We will have something
like this at some point, but for now it clutters the source.


# 0daccb9c 21-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn


# a00428ef 20-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

In soreceive(), when considering delivery to a socket in SS_ISCONFIRMING,
only call the protocol's pru_rcvd() if the protocol has the flag
PR_WANTRCVD set. This brings that instance of pru_rcvd() into line with
the rest, which do check the flag.

MFC after: 3 days


# a7ae36bc 18-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

Correct a typo in the comment describing soreceive_rcvoob().

MFC after: 3 days


# 1b5c4b15 18-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

In soconnect(), when resetting so->so_error, the socket lock is not
required due to a straight integer write in which minor races are not
a problem.


# 78e43644 18-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where
the rest of the accept filter code currently lives.

MFC after: 3 days


# 627de7fa 18-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

Re-order checks in socheckuid() so that we check all deny cases before
returning accept.

MFC after: 3 days


# 0d89301c 17-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

In solisten(), unconditionally set the SO_ACCEPTCONN option in
so->so_options when solisten() will succeed, rather than setting it
conditionally based on there not being queued sockets in the completed
socket queue. Otherwise, if the protocol exposes new sockets via the
completed queue before solisten() completes, the listen() system call
will succeed, but the socket and protocol state will be out of sync.
For TCP, this didn't happen in practice, as the TCP code will panic if
a new connection comes in after the tcpcb has been transitioned to a
listening state but the socket doesn't have SO_ACCEPTCONN set.

This is historical behavior resulting from bitrot since 4.3BSD, in which
that line of code was associated with the conditional NULL'ing of the
connection queue pointers (one-time initialization to be performed
during the transition to a listening socket), which are now initialized
separately.

Discussed with: fenner, gnn
MFC after: 3 days


# 90d52f2f 23-Jan-2005 Gleb Smirnoff <glebius@FreeBSD.org>

- Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from
short to unsigned short.
- Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX.

Before this change setting somaxconn to smth above 32767 and calling
listen(fd, -1) lead to a socket, which doesn't accept connections at all.

Reviewed by: rwatson
Reported by: Igor Sysoev


# fdf84ec4 12-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

When re-connecting already connected datagram socket ensure to clean
up its pending error state, which may be set in some rare conditions resulting
in connect() syscall returning that bogus error and making application believe
that attempt to change association has failed, while it has not in fact.

There is sockets/reconnect regression test which excersises this bug.

MFC after: 2 weeks


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# ba653911 22-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Remove an XXXRW indicating atomic operations might be used as a
substitute for a global mutex protecting the socket count and
generation number.

The observation that soreceive_rcvoob() can't return an mbuf
chain is a property, not a bug, so remove the XXXRW.

In sorflush, s/existing/previous/ for code when describing prior
behavior.

For SO_LINGER socket option retrieval, remove an XXXRW about why
we hold the mutex: this is correct and not dubious.

MFC after: 2 weeks


# 81b5dbec 22-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

In soalloc(), simplify the mac_init_socket() handling to remove
unnecessary use of a global variable and simplify the return case.
While here, use ()'s around return values.

In sodealloc(), remove a comment about why we bump the gencnt and
decrement the socket count separately. It doesn't add
substantially to the reading, and clutters the function.

MFC after: 2 weeks


# c73e3e92 09-Dec-2004 Alan Cox <alc@FreeBSD.org>

Remove unneeded code from the zero-copy receive path.

Discussed with: gallatin@
Tested by: ken@


# 1c4dbeda 07-Dec-2004 Alan Cox <alc@FreeBSD.org>

Tidy up the zero-copy receive path: Remove an unneeded argument to
uiomoveco() and userspaceco().


# d297f702 29-Nov-2004 Paul Saab <ps@FreeBSD.org>

If soreceive() is called from a socket callback, there's no reason
to do a window update to the peer (thru an ACK) from soreceive()
itself. TCP will do that upon return from the socket callback.
Sending a window update from soreceive() results in a lock reversal.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson


# 85d11adf 29-Nov-2004 Paul Saab <ps@FreeBSD.org>

Make soreceive(MSG_DONTWAIT) nonblocking. If MSG_DONTWAIT is passed into
soreceive(), then pass in M_DONTWAIT to m_copym(). Also fix up error
handling for the case where m_copym() returns failure.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson


# 1449a2f5 09-Nov-2004 Gleb Smirnoff <glebius@FreeBSD.org>

Since sb_timeo type was increased to int, use INT_MAX instead of SHRT_MAX.
This also gives us ability to close PR.

PR: kern/42352
Approved by: julian (mentor)
MFC after: 1 week


# aae2782b 02-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Acquire the accept mutex in soabort() before calling sotryfree(), as
that is now required.

RELENG_5_3 candidate.

Foot provided by: Dikshie <dikshie at ppk dot itb dot ac dot id>


# 3a82a545 23-Oct-2004 Andre Oppermann <andre@FreeBSD.org>

socreate() does an early abort if either the protocol cannot be found,
or pru_attach is NULL. With loadable protocols the SPACER dummy protocols
have valid function pointers for all methods to functions returning just
EOPNOTSUPP. Thus the early abort check would not detect immediately that
attach is not supported for this protocol. Instead it would correctly
get the EOPNOTSUPP error later on when it calls the protocol specific
attach function.

Add testing against the pru_attach_notsupp() function pointer to the
early abort check as well.


# 81158452 18-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


# 35b260cd 11-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Rework sofree() logic to take into account a possible race with accept().
Sockets in the listen queues have reference counts of 0, so if the
protocol decides to disconnect the pcb and try to free the socket, this
triggered a race with accept() wherein accept() would bump the reference
count before sofree() had removed the socket from the listen queues,
resulting in a panic in sofree() when it discovered it was freeing a
referenced socket. This might happen if a RST came in prior to accept()
on a TCP connection.

The fix is two-fold: to expand the coverage of the accept mutex earlier
in sofree() to prevent accept() from grabbing the socket after the "is it
really safe to free" tests, and to expand the logic of the "is it really
safe to free" tests to check that the refcount is still 0 (i.e., we
didn't race).

RELENG_5 candidate.

Much discussion with and work by: green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


# 76f69398 05-Sep-2004 Robert Watson <rwatson@FreeBSD.org>

Expand the scope of the socket buffer locks in sopoll() to include the
state test as well as set, or we risk a race between a socket wakeup
and registering for select() or poll() on the socket. This does
increase the cost of the poll operation, but can probably be optimized
some in the future.

This appears to correct poll() "wedges" experienced with X11 on SMP
systems with highly interactive applications, and might affect a plethora
of other select() driven applications.

RELENG_5 candidate.

Problem reported by: Maxim Maximov <mcsi at mcsi dot pp dot ru>
Debugged with help of: dwhite


# fe0f2d4e 23-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Conditional acquisition of socket buffer mutexes when testing socket
buffers with kqueue filters is no longer required: the kqueue framework
will guarantee that the mutex is held on entering the filter, either
due to a call from the socket code already holding the mutex, or by
explicitly acquiring it. This removes the last of the conditional
socket locking.


# 7b38f0d3 20-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Back out uipc_socket.c:1.208, as it incorrectly assumes that all
sockets are connection-oriented for the purposes of kqueue
registration. Since UDP sockets aren't connection-oriented, this
appeared to break a great many things, such as RPC-based
applications and services (i.e., NFS). Since jmg isn't around I'm
backing this out before too many more feet are shot, but intend to
investigate the right solution with him once he's available.

Apologies to: jmg
Discussed with: imp, scottl


# 5d6dd468 19-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

make sure that the socket is either accepting connections or is connected
when attaching a knote to it... otherwise return EINVAL...

Pointed out by: benno


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 217a4b6e 10-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Replace a reference to splnet() with a reference to locking in a comment.


# 99901d0a 25-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Do some initial locking on accept filter registration and attach. While
here, close some races that existed in the pre-locking world during low
memory conditions. This locking isn't perfect, but it's closer than
before.


# cdb71f75 18-Jul-2004 David Malone <dwmalone@FreeBSD.org>

The recent changes to control message passing broke some things
that get certain types of control messages (ping6 and rtsol are
examples). This gets the new code closer to working:

1) Collect control mbufs for processing in the controlp ==
NULL case, so that they can be freed by externalize.

2) Loop over the list of control mbufs, as the externalize
function may not know how to deal with chains.

3) In the case where there is no externalize function,
remember to add the control mbuf to the controlp list so
that it will be returned.

4) After adding stuff to the controlp list, walk to the
end of the list of stuff that was added, incase we added
a chain.

This code can be further improved, but this is enough to get most
things working again.

Reviewed by: rwatson


# dad7b41a 15-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

When entering soclose(), assert that SS_NOFDREF is not already set.


# dcee93dc 12-Jul-2004 David Malone <dwmalone@FreeBSD.org>

Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by: alfred


# d58d3648 12-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama


# a294c366 11-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Use sockbuf_pushsync() to synchronize stack and socket buffer state
in soreceive() after removing an MT_SONAME mbuf from the head of the
socket buffer.

When processing MT_CONTROL mbufs in soreceive(), first remove all of
the MT_CONTROL mbufs from the head of the socket buffer to a local
mbuf chain, then feed them into dom_externalize() as a set, which
both avoids thrashing the socket buffer lock when handling multiple
control mbufs, and also avoids races with other threads acting on
the socket buffer when the socket buffer mutex is released to enter
the externalize code. Existing races that might occur if the protocol
externalize method blocked during processing have also been closed.

Now that we synchronize socket buffer and stack state following
modifications to the socket buffer, turn the manual synchronization
that previously followed control mbuf processing with a set of
assertions. This can eventually be removed.

The soreceive() code is now substantially more MPSAFE.


# b7562e17 11-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Add sockbuf_pushsync(), an inline function that, following a change to
the head of the mbuf chains in a socket buffer, re-synchronizes the
cache pointers used to optimize socket buffer appends. This will be
used by soreceive() before dropping socket buffer mutexes to make sure
a consistent version of the socket buffer is visible to other threads.

While here, update copyright to account for substantial rewrite of much
socket code required for fine-grained locking.


# d861372b 11-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Add additional annotations to soreceive(), documenting the effects of
locking on 'nextrecord' and concerns regarding potentially inconsistent
or stale use of socket buffer or stack fields if they aren't carefully
synchronized whenever the socket buffer mutex is released. Document
that the high-level sblock() prevents races against other readers on
the socket.

Also document the 'type' logic as to how soreceive() guarantees that
it will only return one of normal data or inline out-of-band data.


# 0014b343 10-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

In the 'dontblock' section of soreceive(), assert that the mbuf on hand
('m') is in fact the first mbuf in the receive socket buffer.


# 5e44d93f 10-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Break out non-inline out-of-band data receive code from soreceive()
and put it in its own helper function soreceive_rcvoob().


# a04b0939 10-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Assign pointers values of NULL rather than 0 in soreceive().


# 7e17bc9f 10-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

When the MT_SONAME mbuf is popped off of a receive socket buffer
associated with a PR_ADDR protocol, make sure to update the m_nextpkt
pointer of the new head mbuf on the chain to point to the next record.
Otherwise, when we release the socket buffer mutex, the socket buffer
mbuf chain may be in an inconsistent state.


# 5c2b7a22 09-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Now socket buffer locks are being asserted at higher code blocks in
soreceive(), remove some leaf assertions that are redundant.


# 32775a01 09-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Assert socket buffer lock at strategic points between sections of code
in soreceive() to confirm we've moved from block to block properly
maintaining locking invariants.


# 6a72b225 05-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT.
A subset of locking changes to soreceive() in the queue for merging.

Bumped into by: Willem Jan Withagen <wjw@withagen.nl>


# a2905746 26-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Add a new global mutex, so_global_mtx, which protects the global variables
so_gencnt, numopensockets, and the per-socket field so_gencnt. Annotate
this this might be better done with atomic operations.

Annotate what accept_mtx protects.


# 11c40a39 26-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Replace comment on spl state when calling soabort() with a comment on
locking state. No socket locks should be held when calling soabort()
as it will call into protocol code that may acquire socket locks.


# c6b93bf2 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Lock socket buffers when processing setting socket options SO_SNDLOWAT
or SO_RCVLOWAT for read-modify-write.


# adb4cf0f 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Slide socket buffer lock earlier in sopoll() to cover the call into
selrecord(), setting up select and flagging the socker buffers as SB_SEL
and setting up select under the lock.


# fea24c0a 21-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Remove spl's from uipc_socket to ease in merging.


# a34b7046 20-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Merge next step in socket buffer locking:

- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().

- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().

- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# fa8368a8 20-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

When retrieving the SO_LINGER socket option for user space, hold the
socket lock over pulling so_options and so_linger out of the socket
structure in order to retrieve a consistent snapshot. This may be
overkill if user space doesn't require a consistent snapshot.


# 6f4b1b55 20-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Convert an if->panic in soclose() into a call to KASSERT().


# ed2f7766 20-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Annotate some ordering-related issues in solisten() which are not yet
resolved by socket locking: in particular, that we test the connection
state at the socket layer without locking, request that the protocol
begin listening, and then set the listen state on the socket
non-atomically, resulting in a non-atomic cross-layer test-and-set.


# 31f555a1 18-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state. Convert tsleep() into msleep() with socket buffer mutex
as argument. Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state. Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation. Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy. Assert the socket buffer mutex
strategically after code sections or at the beginning of loops. In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive(). This commit only merges basic socket locking in this
code; follow-up commits will close additional races. As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by: juli, tjr


# 7b574f2e 17-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Hold SOCK_LOCK(so) while frobbing so_options. Note that while the
local race is corrected, there's still a global race in sosend()
relating to so_options and the SO_DONTROUTE flag.


# c0122607 17-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Merge some additional leaf node socket buffer locking from
rwatson_netperf:

Introduce conditional locking of the socket buffer in fifofs kqueue
filters; KNOTE() will be called holding the socket buffer locks in
fifofs, but sometimes the kqueue() system call will poll using the
same entry point without holding the socket buffer lock.

Introduce conditional locking of the socket buffer in the socket
kqueue filters; KNOTE() will be called holding the socket buffer
locks in the socket code, but sometimes the kqueue() system call
will poll using the same entry points without holding the socket
buffer lock.

Simplify the logic in sodisconnect() since we no longer need spls.

NOTE: To remove conditional locking in the kqueue filters, it would
make sense to use a separate kqueue API entry into the socket/fifo
code when calling from the kqueue() system call.


# 9535efc0 17-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Merge additional socket buffer locking from rwatson_netperf:

- Lock down low hanging fruit use of sb_flags with socket buffer
lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.


# c0b99ffa 14-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 395a08c9 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# f6c0cce6 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer. Add accessor macros to use the
mutex (SOCKBUF_*()). Initialize the mutex in soalloc(), and destroy
it in sodealloc(). Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by: sam
Sponosored by: FreeBSD Foundation
Obtained from: BSD/OS


# 1a5ff928 08-Jun-2004 Stefan Farfeleder <stefanf@FreeBSD.org>

Avoid assignments to cast expressions.

Reviewed by: md5
Approved by: das (mentor)


# 2658b3bb 01-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

so_qlen so_incqlen so_qstate
so_comp so_incomp so_list
so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.

- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.

- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.


# 36568179 31-May-2004 Robert Watson <rwatson@FreeBSD.org>

The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.


# 866046f5 31-May-2004 Don Lewis <truckman@FreeBSD.org>

Add MSG_NBIO flag option to soreceive() and sosend() that causes
them to behave the same as if the SS_NBIO socket flag had been set
for this call. The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).

Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation. The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.


# 099a0e58 31-May-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# 123f024b 09-Apr-2004 Robert Watson <rwatson@FreeBSD.org>

Compare pointers with NULL rather than using pointers are booleans in
if/for statements. Assign pointers to NULL rather than typecast 0.
Compare pointers with NULL rather than 0.


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 8e44a7ec 30-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

In sofree(), avoid nested declaration and initialization in
declaration. Observe that initialization in declaration is
frequently incompatible with locking, not just a bad idea
due to style(9).

Submitted by: bde


# 181e65db 29-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Use a common return path for filt_soread() and filt_sowrite() to
simplify the impact of locking on these functions.

Submitted by: sam
Sponsored by: FreeBSD Foundation


# 71c90a29 29-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

In sofree(), moving caching of 'head' from 'so->so_head' to later in
the function once it has been determined to be non-NULL to simplify
locking on an earlier return.


# 746e5bf0 29-Feb-2004 Robert Watson <rwatson@FreeBSD.org>

Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by: sam


# 740d9ba6 29-Feb-2004 Scott Long <scottl@FreeBSD.org>

Convert the other use of flags to mflags in soalloc().


# 2bc87dcf 29-Feb-2004 Robert Watson <rwatson@FreeBSD.org>

Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1. This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by: sam


# f662a931 11-Feb-2004 Brian Feldman <green@FreeBSD.org>

Always socantsendmore() before deallocating a socket. This, in turn,
calls selwakeup() if necessary (which it is, if you don't want freed
memory hanging around on your td->td_selq).

Props to: alfred


# be8a62e8 31-Jan-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.


# 0541040c 18-Jan-2004 Ruslan Ermilov <ru@FreeBSD.org>

Since "m" is not part of the "mp" chain, need to free() it.

Reported by: Stanford Metacompilation research group


# 9e71dd0f 16-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Reduce gratuitous redundancy and length in function names:

mac_setsockopt_label_set() -> mac_setsockopt_label()
mac_getsockopt_label_get() -> mac_getsockopt_label()
mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel()

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 12cbb9dc 15-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested. This fixes process
queries of socket labels.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 512824f8 09-Nov-2003 Seigo Tanimura <tanimura@FreeBSD.org>

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# 395bb186 27-Oct-2003 Sam Leffler <sam@FreeBSD.org>

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


# 184dcdc7 21-Oct-2003 Mike Silbersack <silby@FreeBSD.org>

Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.


# cc342686 04-Aug-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


# 4e19fe10 17-Jul-2003 Robert Drehmel <robert@FreeBSD.org>

To avoid a kernel panic provoked by a NULL pointer dereference,
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().

The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it. knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).

PR: kern/54331


# 330841c7 14-Jul-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok.

Reported by: arr


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 104a9b7e 29-Apr-2003 Alexander Kabaev <kan@FreeBSD.org>

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 695d74f3 14-Apr-2003 Olivier Houchard <cognet@FreeBSD.org>

Use while (*controlp != NULL) instead of do ... while (*control != NULL)
There are valid cases where *controlp will be NULL at this point.

Discussed with: dwmalone


# 8994a245 02-Mar-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Clean up whitespace, s/register //, refrain from strong urge to ANSIfy.


# c9524588 02-Mar-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

uiomove-related caddr_t -> void * (just the low-hanging fruit)


# d6bf2378 19-Feb-2003 Olivier Houchard <cognet@FreeBSD.org>

Remove duplicate includes.

Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 6f7cab93 17-Jan-2003 Thomas Moestl <tmm@FreeBSD.org>

Disallow listen() on sockets which are in the SS_ISCONNECTED or
SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates
in this case).
listen() on connected or connecting sockets would cause them to enter
a bad state; in the TCP case, this could cause sockets to go
catatonic or panics, depending on how the socket was connected.

Reviewed by: -net
MFC after: 2 weeks


# 48e3128b 12-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# cd72f218 11-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# a09de2f7 05-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

In sodealloc(), if there is an accept filter present on the socket
then call do_setopt_accept_filter(so, NULL) which will free the filter
instead of duplicating the code in do_setopt_accept_filter().

Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>


# 6ce9c72c 23-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

s/sokqfilter/soo_kqfilter/ for consistency with the naming of all
other socket/file operations.


# 8819f45b 27-Nov-2002 Maxim Konovalov <maxim@FreeBSD.org>

Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero.

PR: kern/32827
Submitted by: Hartmut Brandt <brandt@fokus.gmd.de>
Approved by: re (jhb)
MFC after: 2 weeks


# 29f19445 08-Nov-2002 Alfred Perlstein <alfred@FreeBSD.org>

Fix instances of macros with improperly parenthasized arguments.

Verified by: md5


# 247a32f2 05-Nov-2002 Kelly Yancey <kbyanc@FreeBSD.org>

Fix filt_soread() to properly flag a kevent when a 0-byte datagram is
received.

Verified by: dougb, Manfred Antar <null@pozo.com>
Sponsored by: NTT Multimedia Communications Labs


# 5ee0a409 01-Nov-2002 Alan Cox <alc@FreeBSD.org>

Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing
a panic because the socket's state isn't as expected by sofree().

Discussed with: dillon, fenner


# e0f640e8 01-Nov-2002 Kelly Yancey <kbyanc@FreeBSD.org>

Track the number of non-data chararacters stored in socket buffers so that
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.

PR: 30634
Reviewed by: -net, -arch
Sponsored by: NTT Multimedia Communications Labs
MFC after: 2 weeks


# 6151efaa 28-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Trim extraneous #else and #endif MAC comments per style(9).


# 83985c26 05-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Modify label allocation semantics for sockets: pass in soalloc's malloc
flags so that we can call malloc with M_NOWAIT if necessary, avoiding
potential sleeps while holding mutexes in the TCP syncache code.
Similar to the existing support for mbuf label allocation: if we can't
allocate all the necessary label store in each policy, we back out
the label allocation and fail the socket creation. Sync from MAC tree.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# ea6027a8 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 5c5384fe 12-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Use the credential authorizing the socket creation operation to perform
the jail check and the MAC socket labeling in socreate(). This handles
socket creation using a cached credential better (such as in the NFS
client code when rebuilding a socket following a disconnect: the new
socket should be created using the nfsmount cached cred, not the cred
of the thread causing the socket to be rebuilt).

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# f9d0d524 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by: bde


# b8279195 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Implement two IOCTLs at the socket level to retrieve the primary
and peer labels from a socket. Note that this user process interface
will be changing to improve multi-policy support.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 335654d7 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on sockets.
In particular, invoke entry points during socket allocation and
destruction, as well as creation by a process or during an
accept-scenario (sonewconn). For UNIX domain sockets, also assign
a peer label. As the socket code isn't locked down yet, locking
interactions are not yet clear. Various protocol stack socket
operations (such as peer label assignment for IPv4) will follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 5f0de712 24-Jul-2002 Mike Barcroft <mike@FreeBSD.org>

Catch up to rev 1.87 of sys/sys/socketvar.h (sb_cc changed from u_long
to u_int).

Noticed by: sparc64 tinderbox


# 80208239 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

More caddr_t removal.
Change struct knote's kn_hook from caddr_t to void *.


# 98cb733c 25-Jun-2002 Kenneth D. Merry <ken@FreeBSD.org>

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# c33c8251 20-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

Implement SO_NOSIGPIPE option for sockets. This allows one to request that
an EPIPE error return not generate SIGPIPE on sockets.

Submitted by: lioux
Inspired by: Darwin


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# ec418160 21-May-2002 Andrew R. Reiter <arr@FreeBSD.org>

- td will never be NULL, so the call to soalloc() in socreate() will always
be passed a 1; we can, however, use M_NOWAIT to indicate this.
- Check so against NULL since it's a pointer to a structure.


# 1515cd22 21-May-2002 Andrew R. Reiter <arr@FreeBSD.org>

- OR the flag variable with M_ZERO so that the uma_zalloc() handles the
zero'ing out of the allocated memory. Also removed the logical bzero
that followed.


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# e649887b 06-May-2002 Alfred Perlstein <alfred@FreeBSD.org>

Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.


# f1320723 01-May-2002 Alfred Perlstein <alfred@FreeBSD.org>

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# e1f1827f 25-Apr-2002 Mike Silbersack <silby@FreeBSD.org>

Make sure that sockets undergoing accept filtering are aborted in a
LRU fashion when the listen queue fills up. Previously, there was
no mechanism to kick out old sockets, leading to an easy DoS of
daemons using accept filtering.

Reviewed by: alfred
MFC after: 3 days


# 20504246 07-Apr-2002 Jeffrey Hsu <hsu@FreeBSD.org>

There's only one socket zone so we don't need to remember it
in every socket structure.


# 59295dba 20-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

UMA permited us to utilize the 'waitok' flag to soalloc.


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 167b8d03 28-Feb-2002 Ian Dowse <iedowse@FreeBSD.org>

In sosend(), enforce the socket buffer limits regardless of whether
the data was supplied as a uio or an mbuf. Previously the limit was
ignored for mbuf data, and NFS could run the kernel out of mbufs
when an ipfw rule blocked retransmissions.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# ecde8f7c 04-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

Get rid of the twisted MFREE() macro entirely.

Reviewed by: dg, bmilekic
MFC after: 3 days


# 468485b8 14-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Fix select on fifos.

Backout revision 1.56 and 1.57 of fifo_vnops.c.

Introduce a new poll op "POLLINIGNEOF" that can be used to ignore
EOF on a fifo, POLLIN/POLLRDNORM is converted to POLLINIGNEOF within
the FIFO implementation to effect the correct behavior.

This should allow one to view a fifo pretty much as a data source
rather than worry about connections coming and going.

Reviewed by: bde


# 9c4d63da 31-Dec-2001 Robert Watson <rwatson@FreeBSD.org>

o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread
argument.

o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by: freebsd-arch


# b1e4abd2 16-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.


# 7377f0d1 12-Nov-2001 Giorgos Keramidas <keramida@FreeBSD.org>

Remove EOL whitespace.

Reviewed by: alfred


# 074df018 12-Nov-2001 Giorgos Keramidas <keramida@FreeBSD.org>

Make KASSERT's print the values that triggered a panic.

Reviewed by: alfred


# bd78cece 11-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# 8a7d8cc6 09-Oct-2001 Robert Watson <rwatson@FreeBSD.org>

- Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.

Reviewed by: ps, billf
Obtained from: TrustedBSD Project


# 4787fd37 05-Oct-2001 Paul Saab <ps@FreeBSD.org>

Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks


# 2bc21ed9 04-Oct-2001 David Malone <dwmalone@FreeBSD.org>

Hopefully improve control message passing over Unix domain sockets.

1) Allow the sending of more than one control message at a time
over a unix domain socket. This should cover the PR 29499.

2) This requires that unp_{ex,in}ternalize and unp_scan understand
mbufs with more than one control message at a time.

3) Internalize and externalize used to work on the mbuf in-place.
This made life quite complicated and the code for sizeof(int) <
sizeof(file *) could end up doing the wrong thing. The patch always
create a new mbuf/cluster now. This resulted in the change of the
prototype for the domain externalise function.

4) You can now send SCM_TIMESTAMP messages.

5) Always use CMSG_DATA(cm) to determine the start where the data
in unp_{ex,in}ternalize. It was using ((struct cmsghdr *)cm + 1)
in some places, which gives the wrong alignment on the alpha.
(NetBSD made this fix some time ago).

This results in an ABI change for discriptor passing and creds
passing on the alpha. (Probably on the IA64 and Spare ports too).

6) Fix userland programs to use CMSG_* macros too.

7) Be more careful about freeing mbufs containing (file *)s.
This is made possible by the prototype change of externalise.

PR: 29499
MFC after: 6 weeks


# ed01445d 21-Sep-2001 John Baldwin <jhb@FreeBSD.org>

Use the passed in thread to selrecord() instead of curthread.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 3abedb4e 27-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

Actually show the values that tripped the assertion "receive 1"


# 4d286823 16-Mar-2001 Jonathan Lemon <jlemon@FreeBSD.org>

When doing a recv(.. MSG_WAITALL) for a message which is larger than
the socket buffer size, the receive is done in sections. After completing
a read, call pru_rcvd on the underlying protocol before blocking again.

This allows the the protocol to take appropriate action, such as
sending a TCP window update to the peer, if the window happened to
close because the socket buffer was filled. If the protocol is not
notified, a TCP transfer may stall until the remote end sends a window
probe.


# c0647e0d 09-Mar-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Push the test for a disconnected socket when accept()ing down to the
protocol layer. Not all protocols behave identically. This fixes the
brokenness observed with unix-domain sockets (and postfix)


# 8ac6dca7 27-Feb-2001 Ruslan Ermilov <ru@FreeBSD.org>

In soshutdown(), use SHUT_{RD,WR,RDWR} instead of FREAD and FWRITE.
Also, return EINVAL if `how' is invalid, as required by POSIX spec.


# da403b9d 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a NOTE_LOWAT flag for use with the read/write filters, which
allow the watermark to be passed in via the data field during the EV_ADD
operation.

Hook this up to the socket read/write filters; if specified, it overrides
the so_{rcv|snd}.sb_lowat values in the filter.

Inspired by: "Ronald F. Guilmette" <rfg@monkeys.com>


# b07540c8 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

When returning EV_EOF for the socket read/write filters, also return
the current socket error in fflags. This may be useful for determining
why a connect() request fails.

Inspired by: "Jonathan Graehl" <jonathan@graehl.org>


# 91421ba2 20-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# 608a3ce6 15-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter


# 2fd7d53d 13-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Return ECONNABORTED from accept if connection is closed while on the
listen queue, as well as the current behavior of a zero-length sockaddr.

Obtained from: KAME
Reviewed by: -net


# a3ea6d41 21-Jan-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

First step towards an MP-safe zone allocator:
- have zalloc() and zfree() always lock the vm_zone.
- remove zalloci() and zfreei(), which are now redundant.

Reviewed by: bmilekic, jasone


# 2a0c503e 21-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 830fedd2 19-Nov-2000 Alfred Perlstein <alfred@FreeBSD.org>

Accept filters broke kernels compiled without options INET.
Make accept filters conditional on INET support to fix.

Pointed out by: bde
Tested and assisted by: Stephen J. Kiernan <sab@vegamuse.org>


# d5aa1234 27-Sep-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Check so_error in filt_so{read|write} in order to detect UDP errors.

PR: 21601


# f535380c 05-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 6aef685f 29-Aug-2000 Brian Feldman <green@FreeBSD.org>

Remove any possibility of hiwat-related race conditions by changing
the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and
a u_long target to set it to. The whole thing is splnet().

This fixes a problem that jdp has been able to provoke.


# a1144591 07-Aug-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Make the kqueue socket read filter honor the SO_RCVLOWAT value.

Spotted by: "Steve M." <stevem@redlinenetworks.com>


# f4088964 19-Jul-2000 Alfred Perlstein <alfred@FreeBSD.org>

only allow accept filter modifications on listening sockets

Submitted by: ps


# c6362551 22-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

fix races in the uidinfo subsystem, several problems existed:

1) while allocating a uidinfo struct malloc is called with M_WAITOK,
it's possible that while asleep another process by the same user
could have woken up earlier and inserted an entry into the uid
hash table. Having redundant entries causes inconsistancies that
we can't handle.

fix: do a non-waiting malloc, and if that fails then do a blocking
malloc, after waking up check that no one else has inserted an entry
for us already.

2) Because many checks for sbsize were done as "test then set" in a non
atomic manner it was possible to exceed the limits put up via races.

fix: instead of querying the count then setting, we just attempt to
set the count and leave it up to the function to return success or
failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
and deletions needed to be in their own functions for clarity.

Reviewed by: green


# a79b7128 19-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

return of the accept filter part II

accept filters are now loadable as well as able to be compiled into
the kernel.

two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)

Reviewed by: jmg


# a72fda71 18-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

backout accept optimizations.

Requested by: jmg, dcs, jdp, nate


# 8f4e4aa5 15-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

add socketoptions DELAYACCEPT and HTTPACCEPT which will not allow an accept()
until the incoming connection has either data waiting or what looks like a
HTTP request header already in the socketbuffer. This ought to reduce
the context switch time and overhead for processing requests.

The initial idea and code for HTTPACCEPT came from Yahoo engineers and has
been cleaned up and a more lightweight DELAYACCEPT for non-http servers
has been added

Reviewed by: silence on hackers.


# 3b43fd62 13-Jun-2000 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Fix panic by moving the prp == 0 check up the order of sanity checks.

Submitted by: Bart Thate <freebsd@1st.dudi.org> on -current
Approved by: rwatson


# 7cadc266 03-Jun-2000 Robert Watson <rwatson@FreeBSD.org>

o Modify jail to limit creation of sockets to UNIX domain sockets,
TCP/IP (v4) sockets, and routing sockets. Previously, interaction
with IPv6 was not well-defined, and might be inappropriate for some
environments. Similarly, sysctl MIB entries providing interface
information also give out only addresses from those protocol domains.

For the time being, this functionality is enabled by default, and
toggleable using the sysctl variable jail.socket_unixiproute_only.
In the future, protocol domains will be able to determine whether or
not they are ``jail aware''.

o Further limitations on process use of getpriority() and setpriority()
by jailed processes. Addresses problem described in kern/17878.

Reviewed by: phk, jmg


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# 95b2b777 18-Mar-2000 Bill Fenner <fenner@FreeBSD.org>

Make sure to free the socket in soabort() if the protocol couldn't
free it (this could happen if the protocol already freed its part
and we just kept the socket around to make sure accept(2) didn't block)


# bfbbc4aa 13-Jan-2000 Jason Evans <jasone@FreeBSD.org>

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# c2696359 26-Dec-1999 Brian Feldman <green@FreeBSD.org>

Correct an uninitialized variable use, which, unlike most times, is
actually a bug this time.

Submitted by: bde
Reviewed by: bde


# f48b807f 11-Dec-1999 Brian Feldman <green@FreeBSD.org>

This is Bosko Milekic's mbuf allocation waiting code. Basically, this
means that running out of mbuf space isn't a panic anymore, and code
which runs out of network memory will sleep to wait for it.

Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: green, wollman


# 82cd038d 21-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 2e3c8fcb 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# ecf72308 09-Oct-1999 Brian Feldman <green@FreeBSD.org>

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


# 2f9a2132 18-Sep-1999 Brian Feldman <green@FreeBSD.org>

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# f29be021 17-Jun-1999 Brian Feldman <green@FreeBSD.org>

Reviewed by: the cast of thousands

This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.


# 9c9906e9 03-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected
to either enqueue or free their mbuf chains, but tcp_usr_send() was
dropping them on the floor if the tcpcb/inpcb has been torn down in the
middle of a send/write attempt. This has been responsible for a wide
variety of mbuf leak patterns, ranging from slow gradual leakage to rather
rapid exhaustion. This has been a problem since before 2.2 was branched
and appears to have been fixed in rev 1.16 and lost in 1.23/1.28.

Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking
(extensively) into this on a live production 2.2.x system and that it
was the actual cause of the leak and looks like it fixes it. The machine
in question was loosing (from memory) about 150 mbufs per hour under
load and a change similar to this stopped it. (Don't blame Jayanth
for this patch though)

An alternative approach to this would be to recheck SS_CANTSENDMORE etc
inside the splnet() right before calling pru_send() after all the potential
sleeps, interrupts and delays have happened. However, this would mean
exposing knowledge of the tcp stack's reset handling and removal of the
pcb to the generic code. There are other things that call pru_send()
directly though.

Problem originally noted by: John Plevyak <jplevyak@inktomi.com>


# 925fa5c3 21-May-1999 Andrey A. Chernov <ache@FreeBSD.org>

Realy fix overflow on SO_*TIMEO

Submitted by: bde


# 3d177f46 03-May-1999 Bill Fumerola <billf@FreeBSD.org>

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 02a3d526 24-Apr-1999 Andrey A. Chernov <ache@FreeBSD.org>

Lite2 bugfixes merge:
so_linger is in seconds, not in 1/HZ
range checking in SO_*TIMEO was wrong

PR: 11252


# ce02431f 16-Feb-1999 Doug Rabson <dfr@FreeBSD.org>

* Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)


# 8f70ac3e 02-Feb-1999 Bill Fenner <fenner@FreeBSD.org>

Fix the port of the NetBSD 19990120-accept fix. I misread a piece of
code when examining their fix, which caused my code (in rev 1.52) to:
- panic("soaccept: !NOFDREF")
- fatal trap 12, with tracebacks going thru soclose and soaccept


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 527b7a14 25-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Port NetBSD's 19990120-accept bug fix. This works around the race condition
where select(2) can return that a listening socket has a connected socket
queued, the connection is broken, and the user calls accept(2), which then
blocks because there are no connections queued.

Reviewed by: wollman
Obtained from: NetBSD
(ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)


# 7b177710 20-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Also consider the space left in the socket buffer when deciding whether
to set PRUS_MORETOCOME.


# b0acefa8 20-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.

Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# f1d19042 07-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# 831d27a9 11-Nov-1998 Don Lewis <truckman@FreeBSD.org>

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 9898afa1 31-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Bow to tradition and correctly implement the bogus-but-hallowed semantics
of getsockopt never telling how much it might have copied if only the
buffer were big enough.


# d224dbc1 31-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Correctly set the return length regardless of the relative size of the
user's buffer. Simplify the logic a bit. (Can we have a version of
min() for size_t?)


# cfe8b629 22-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# 0c495036 18-Jul-1998 Bill Fenner <fenner@FreeBSD.org>

Undo rev 1.41 until we get more details about why it makes some systems
fail.


# dece5b6a 06-Jul-1998 Bill Fenner <fenner@FreeBSD.org>

Introduce (fairly hacky) workaround for odd TCP behavior with application
writes of size (100,208]+N*MCLBYTES.

The bug:
sosend() hands each mbuf off to the protocol output routine as soon as it
has copied it, in the hopes of increasing parallelism (see
http://www.kohala.com/~rstevens/vanj.88jul20.txt ). This works well for
TCP as long as the first mbuf handed off is at least the MSS. However,
when doing small writes (between MHLEN and MINCLSIZE), the transaction is
split into 2 small MBUF's and each is individually handed off to TCP.
TCP assumes that the first small mbuf is the whole transaction, so sends
a small packet. When the second small mbuf arrives, Nagle prevents TCP
from sending it so it must wait for a (potentially delayed) ACK. This
sends throughput down the toilet.

The workaround:
Set the "atomic" flag when we're doing small writes. The "atomic" flag
has two meanings:
1. Copy all of the data into a chain of mbufs before handing off to the
protocol.
2. Leave room for a datagram header in said mbuf chain.
TCP wants the first but doesn't want the second. However, the second
simply results in some memory wastage (but is why the workaround is a
hack and not a fix).

The real fix:
The real fix for this problem is to introduce something like a "requested
transfer size" variable in the socket->protocol interface. sosend()
would then accumulate an mbuf chain until it exceeded the "requested
transfer size". TCP could set it to the TCP MSS (note that the
current interface causes strange TCP behaviors when the MSS > MCLBYTES;
nobody notices because MCLBYTES > ethernet's MTU).


# 98271db4 15-May-1998 Garrett Wollman <wollman@FreeBSD.org>

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 08637435 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 4049a042 01-Mar-1998 Guido van Rooij <guido@FreeBSD.org>

Make sure that you can only bind a more specific address when it is
done by the same uid.
Obtained from: OpenBSD


# 92f57d00 19-Feb-1998 Bill Fenner <fenner@FreeBSD.org>

Revert sosend() to its behavior from 4.3-Tahoe and before: if
so_error is set, clear it before returning it. The behavior
introduced in 4.3-Reno (to not clear so_error) causes potentially
transient errors (e.g. ECONNREFUSED if the other end hasn't opened
its socket yet) to be permanent on connected datagram sockets that
are only used for writing.

(soreceive() clears so_error before returning it, as does
getsockopt(...,SO_ERROR,...).)

Submitted by: Van Jacobson <van@ee.lbl.gov>, via a comment in the vat sources.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 64bd2f7b 08-Nov-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

MF22: MSG_EOR bug fix.
Submitted by: wollman


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# eabecea3 04-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

While booting diskless we have no proc pointer.


# e25aa68e 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Extend select backend for sockets to work with a poll interface (more
detail is passed back and forwards). This mostly came from NetBSD, except
that our interfaces have changed a lot and this funciton is in a different
part of the kernel.

Obtained from: NetBSD


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# b1037dcd 21-Aug-1997 Bruce Evans <bde@FreeBSD.org>

#include <machine/limits.h> explicitly in the few places that it is required.


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 006ad618 27-Jun-1997 Peter Wemm <peter@FreeBSD.org>

Don't accept insane values for SO_(SND|RCV)BUF, and the low water marks.
Specifically, don't allow a value < 1 for any of them (it doesn't make
sense), and don't let the low water mark be greater than the corresponding
high water mark.

Pre-Approved by: wollman
Obtained from: NetBSD


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 639acc13 24-Feb-1997 Garrett Wollman <wollman@FreeBSD.org>

Create a new branch of the kernel MIB, kern.ipc, to store
all of the configurables and instrumentation related to
inter-process communication mechanisms. Some variables,
like mbuf statistics, are instrumented here for the first
time.

For mbuf statistics: also keep track of m_copym() and
m_pullup() failures, and provide for the user's inspection
the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# add2e5d0 29-Nov-1996 David Greenman <dg@FreeBSD.org>

Check for error return from uiomove to prevent looping endlessly in
soreceive(). Closes PR#2114.

Submitted by: wpaul


# ebb0cbea 06-Oct-1996 Paul Traina <pst@FreeBSD.org>

Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com


# 2c37256e 11-Jul-1996 Garrett Wollman <wollman@FreeBSD.org>

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


# 82dab6ce 09-May-1996 Garrett Wollman <wollman@FreeBSD.org>

Make it possible to return more than one piece of control information
(PR #1178).
Define a new SO_TIMESTAMP socket option for datagram sockets to return
packet-arrival timestamps as control information (PR #1179).

Submitted by: Louis Mamakos <loiue@TransSys.com>


# 46f578e7 15-Apr-1996 David Greenman <dg@FreeBSD.org>

Fix for PR #1146: the "next" pointer must be cached before calling soabort
since the struct containing it may be freed.


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# be24e9e8 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Changed socket code to use 4.4BSD queue macros. This includes removing
the obsolete soqinsque and soqremque functions as well as collapsing
so_q0len and so_qlen into a single queue length of unaccepted connections.
Now the queue of unaccepted & complete connections is checked directly
for queued sockets. The new code should be functionally equivilent to
the old while being substantially faster - especially in cases where
large numbers of connections are often queued for accept (e.g. http).


# dc915e7c 13-Feb-1996 Garrett Wollman <wollman@FreeBSD.org>

Kill XNS.
While we're at it, fix socreate() to take a process argument. (This
was supposed to get committed days ago...)


# b1358054 07-Feb-1996 Garrett Wollman <wollman@FreeBSD.org>

Define a new socket option, SO_PRIVSTATE. Getting it returns the state
of the SS_PRIV flag in so_state; setting it always clears same.


# 47daf5d5 14-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Nuked ambiguous sleep message strings:
old: new:
netcls[] = "netcls" "soclos"
netcon[] = "netcon" "accept", "connec"
netio[] = "netio" "sblock", "sbwait"


# ff5c09da 03-Nov-1995 Garrett Wollman <wollman@FreeBSD.org>

Make somaxconn (maximum backlog in a listen(2) request) and sb_max
(maximum size of a socket buffer) tunable.

Permit callers of listen(2) to specify a negative backlog, which
is translated into somaxconn. Previously, a negative backlog was
silently translated into 0.


# 5e319b84 25-Aug-1995 Bruce Evans <bde@FreeBSD.org>

Remove extra arg from one of the calls to (*pr_usrreq)().


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 5f540404 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

getsockopt(s, SOL_SOCKET, SO_SNDTIMEO, ...) would construct the returned
timeval incorrectly, truncating the usec part.

Obtained from: Stevens vol. 2 p. 548


# 6b8fda4d 06-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge in the socket-level support for Transaction TCP.


# a635d6c7 05-Feb-1995 David Greenman <dg@FreeBSD.org>

Use M_NOWAIT instead of M_KERNEL for socket allocations; it is apparantly
possible for certain socket operations to occur during interrupt context.

Submitted by: John Dyson


# 9f518539 02-Feb-1995 David Greenman <dg@FreeBSD.org>

Calling semantics for kmem_malloc() have been changed...and the third
argument is now more than just a single flag. (kern_malloc.c)
Used new M_KERNEL value for socket allocations that previous were
"M_NOWAIT". Note that this will change when we clean up the M_ namespace
mess.

Submitted by: John Dyson


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 3962127e 29-May-1994 David Greenman <dg@FreeBSD.org>

Changed mbuf allocation policy to get a cluster if size > MINCLSIZE. Makes
a BIG difference in socket performance.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources