History log of /freebsd-current/sys/netgraph/ng_ksocket.c
Revision Date Author Comments
# 1a3d1be4 22-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

ng_ksocket: use new macros to lock socket buffers


# f79a8585 30-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: garbage collect SS_ISCONFIRMING

Fixes: 8df32b19dee92b5eaa4b488ae78dca6accfcb38e


# 0fac350c 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on getpeername/getsockname

Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.

Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694


# cfb1e929 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on accept(2)

Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.

While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.

Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635


# 43f7e216 17-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

ng_ksocket: fix accept(2)

- Provide listen upcall and set it on NGM_KSOCKET_LISTEN
- Mask EWOULDBLOCK on NGM_KSOCKET_ACCEPT

Reviewed by: afedorov
Differential Revision: https://reviews.freebsd.org/D42637
PR: 272319
PR: 275106
Fixes: 779f106aa169256b7010a1d8f963ff656b881e92


# efad7cbf 17-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

ng_ksocket: fix upcall clearing on node shutdown

Note: imho, the proper solution would be to guarantee that upcalls
won't ever be called after soclose(), but this isn't the case, yet.
This change at least makes the node work the way it always worked.

Reviewed by: afedorov
Differential Revision: https://reviews.freebsd.org/D42636
PR: 272319
PR: 275106
Fixes: 779f106aa169256b7010a1d8f963ff656b881e92


# 95ee2897 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: two-line .h pattern

Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/


# 8624f434 30-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

divert: declare PF_DIVERT domain and stop abusing PF_INET

The divert(4) is not a protocol of IPv4. It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back. It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols. The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char. I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.

Moving divert(4) out of inetsw accomplished two important things:

1) IPDIVERT is getting much closer to be not dependent on INET.
This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
Domain/proto selection code won't need a hack for SOCK_RAW and
multiple entries in inetsw implementing different flavors of
raw socket can merge into one without requirement of raw IPv4
being the last member of dom_protosw.

Differential revision: https://reviews.freebsd.org/D36379


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 497c2cf4 07-Apr-2022 John Baldwin <jhb@FreeBSD.org>

ng_ksocket: Remove unused variable.


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 7737de95 14-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Check return value from soaccept().

Coverity: 1376209


# 779f106a 08-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Listening sockets improvements.

o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.

o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770


# 053359b7 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/netgraph: spelling fixes in comments.

No functional change.


# 45c203fc 14-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove AppleTalk support.

AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.


# 2c284d93 13-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove IPX support.

IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.


# 9165bf62 21-Dec-2013 Gleb Smirnoff <glebius@FreeBSD.org>

In r248885 I have reduced size of fake uio resid that ng_ksocket(4) passes
to the soreceive(). This exposed a bug. When reading from a raw socket,
when our fake limit is depleted, we receive a truncated mbuf chain, with
m->m_pkthdr.len > m_length(m). The first problem is that MSG_TRUNC was not
handled. The second one is that we didn't reinit uio_resid in our endless
loop (neither flags), and if socket buffer contained several records, then
we quickly deplete our fake limit. The third bug, actually introduced in
r248885, is that MJUMPAGESIZE isn't enough to handle maximum packet that
ng_ksocket(4) can theoretically receive.

Changes:
- Reinit uio_resid and flags before every call to soreceive().
- Set maximum acceptable size of packet to IP_MAXPACKET. As for now the
module doesn't support INET6.
- Properly handle MSG_TRUNC return from soreceive().

PR: 184601
Submitted & tested by: Viktor Velichkin <avisom yandex.ru>
Sponsored by: Nginx, Inc.


# 9a4d9e19 29-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Revamp mbuf handling in ng_ksocket_incoming2():

- Clear code that workarounded a bug in FreeBSD 3,
and even predated import of netgraph(4).
- Clear workaround for m_nextpkt pointing into
next record in buffer (fixed in r248884).
Assert that m_nextpkt is clear.
- Do not rely on SOCK_STREAM sockets containing
M_PKTHDR mbufs. Create a header ourselves and
attach chain to it. This is correct fix for
kern/154676.

PR: kern/154676
Sponsored by: Nginx, Inc


# 6b1781e3 29-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Whitespace.


# d09c774b 29-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Non-functional cleanup of ng_ksocket_incoming2().


# c9b652e3 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.


# 38f1b2d1 24-May-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Revert r220768 for ng_ksocket. This node is special and
when it is cloning, its constructor method may be called
in a context that isn't allowed to sleep.

Noticed by: Vadim Goncharov


# dc15eac0 01-Jan-2012 Ed Schouten <ed@FreeBSD.org>

Use strchr() and strrchr().

It seems strchr() and strrchr() are used more often than index() and
rindex(). Therefore, simply migrate all kernel code to use it.

For the XFS code, remove an empty line to make the code identical to
the code in the Linux kernel.


# d745c852 06-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.

This means that their use is restricted to a single C file.


# 674d86bf 18-Apr-2011 Gleb Smirnoff <glebius@FreeBSD.org>

Node constructor methods are supposed to be called in syscall
context always. Convert nodes to consistently use M_WAITOK flag
for memory allocation.

Reviewed by: julian


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# f9e4dd71 06-May-2010 Fabien Thomas <fabient@FreeBSD.org>

Fix an invalid parameter detected by INVARIANT and confirmed by r193272.


# a35d1812 14-Apr-2010 Alexander Motin <mav@FreeBSD.org>

MFC r206017:
Make ng_ksocket fulfill lower protocol stack layers alignment requirements
on platforms with strict alignment constraints.
This fixes kernel panics on arm and probably other architectures.

PR: sparc64/80410, sparc/118932, sparc/113556


# 5c100aea 31-Mar-2010 Alexander Motin <mav@FreeBSD.org>

Make ng_ksocket fulfill lower protocol stack layers alignment requirements
on platforms with strict alignment constraints.
This fixes kernel panics on arm and probably other architectures.

PR: sparc64/80410


# fe1d3f15 28-Jun-2009 Stanislav Sedov <stas@FreeBSD.org>

- Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required
for ZFS as its getattr implementation may sleep.

Approved by: re (rwatson)
Reviewed by: kib
MFC after: 2 weeks


# 74fb0ba7 01-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 6e7ed930 07-Mar-2008 Alexander Motin <mav@FreeBSD.org>

Send only one incoming notification at a time to reduce queue
trashing and improve performance.
Remove waitflag argument from ng_ksocket_incoming2(), it means nothing
as function call was queued by netgraph.
Remove node validity check, as node validity guarantied by netgraph.
Update comments.


# 4ae54e2f 08-Feb-2007 Bruce M Simpson <bms@FreeBSD.org>

In the output path, mask off M_BCAST|M_MCAST so as to prevent incorrect
addressing if a packet is later re-encapsulated and sent to a
non-broadcast, non-multicast destination after being received on the
ng_ksocket input hook.

PR: 106999
Submitted by: Kevin Lahey
MFC after: 4 weeks


# b0668f71 24-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod: sam, gnn, wollman


# aa00bc83 21-Feb-2006 Ruslan Ermilov <ru@FreeBSD.org>

Clear csum_flags after reading data from socket buffer. Otherwise,
if ksocket is connected to an interface-type node somewhere later
in the graph (e.g., ng_eiface or ng_iface), the csum_data may be
applied to a wrong packet (if we encapsulate Ethernet or IP).

MFC after: 3 days


# e71fefbe 06-Sep-2005 Gleb Smirnoff <glebius@FreeBSD.org>

When we read data from socket buffer using soreceive() the socket layer
does not clear m_nextpkt for us. The mbufs are sent into netgraph and
then, if they contain a TCP packet delivered locally, they will enter
socket code again. They can pass the first assert in sbappendstream()
because m_nextpkt may be set not in the first mbuf, but deeper in the
chain. So the problem will trigger much later, when local program
reads the data from socket, and an mbuf with m_nextpkt becomes a
first one.

This bug was demasked by revision 1.54, when I made upcall queueable.
Before revision 1.54 there was a very small probability to have 2
mbufs in GRE socket buffer, because ng_ksocket_incoming2() dequeued
the first one immediately.

- in ng_ksocket_incoming2() clear m_nextpkt on all mbufs
read from socket.
- restore rev. 1.54 change in ng_ksocket_incoming().

PR: kern/84952
PR: kern/82413
In collaboration with: rwatson


# d7f56eab 25-Aug-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Backout revision 1.54, because it exposes a worse problem, than
it fixes. I believe the problem lives somewhere outside ng_ksocket,
but until it is found, let the node be working.

PR: kern/84952
PR: kern/82413
MFC after: 3 days


# f6c9d18d 16-May-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Catch up with new ng_send_fn1() interface.


# 0f4a3524 13-May-2005 Gleb Smirnoff <glebius@FreeBSD.org>

When used as divert socket we need to decouple stack when node is entered
from socket side. Use ng_queue_fn() instead of ng_send_fn().


# bc90ff47 18-Apr-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Fix panics with misconfigured routing:
- Backout previous revision, the check is useless.
- Turn node to queue mode, since it is edge node.

Reported by: sem


# f1c6a420 19-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Reimplement recursion protection, checking whether current thread holds
sockbuf mutex.

Reviewed by: rwatson


# 848a25c7 16-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Remove a recursion protection, which we inherited from splnet() netgraph times.
Now several threads may write data to ng_ksocket. Locking of socket is done in
sosend().

Reviewed by: archie, julian, rwatson
MFC after: 2 weeks


# d96bd8d1 12-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Allocate enough space for new tag.

Pointy hat to: glebius


# b07785ef 12-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

When netgraph(4) was converted to use mbuf_tags(9) instead of meta-data
a definite setup was broken: two ng_ksockets are connected to each other,
connect()ed to different remote hosts, and bind()ed to different local
interfaces. In this case one ng_ksocket is fooled with tag from the other
one.

Put node id into tag. In rcvdata method utilize tag only if it has our
own id inside or id equals zero. The latter case is added to support
packets send by some third, not ng_ksocket node.

MFC after: 1 week


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 42ec1da4 02-Sep-2004 Robert Watson <rwatson@FreeBSD.org>

In FreeBSD 5.x, curthread is always defined, so we don't need to to test
and optionally use &thread0 if it's NULL.

Spotted by: julian


# 327b288e 25-Jun-2004 Julian Elischer <julian@FreeBSD.org>

Convert Netgraph to use mbuf tags to pass its meta information around.
Thanks to Sam for importing tags in a way that allowed this to be done.

Submitted by: Gleb Smirnoff <glebius@cell.sick.ru>
Also allow the sr and ar drivers to create netgraph versions of their modules.
Document the change to the ksocket node.


# 9535efc0 17-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Merge additional socket buffer locking from rwatson_netperf:

- Lock down low hanging fruit use of sb_flags with socket buffer
lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.


# c0b99ffa 14-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 395a08c9 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 2658b3bb 01-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

so_qlen so_incqlen so_qstate
so_comp so_incomp so_list
so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.

- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.

- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.


# 36568179 31-May-2004 Robert Watson <rwatson@FreeBSD.org>

The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.


# f8aae777 28-May-2004 Julian Elischer <julian@FreeBSD.org>

Switch to using C99 sparse initialisers for the type methods array.
Should make no binary difference.

Submitted by: Gleb Smirnoff <glebius@cell.sick.ru>
Reviewed by: Harti Brandt <harti@freebsd.org>
MFC after: 1 week


# 87e2c66a 26-Jan-2004 Hartmut Brandt <harti@FreeBSD.org>

Get rid of the deprecated *LEN constants in favour of the new
*SIZ constants that include the trailing \0 byte.


# 7304a833 17-Dec-2003 Ruslan Ermilov <ru@FreeBSD.org>

Replaced two bzero() calls with the M_ZERO flag to malloc().

Reviewed by: julian


# 33583c6f 20-Aug-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Add Protocol Independent Multicast protocol.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


# 7d780740 28-Apr-2003 Archie Cobbs <archie@FreeBSD.org>

Add missing braces.

Submitted by: Andrew Lankford <arlankfo@141.com>


# fcfa0b48 14-Sep-2002 Benno Rice <benno@FreeBSD.org>

Reference the socket we're accepting.


# a7d83226 10-Sep-2002 Benno Rice <benno@FreeBSD.org>

Remember who asked for a connect or accept operation so we can actually tell
them when it's done.

Reviewed by: archie


# facfd889 21-Aug-2002 Archie Cobbs <archie@FreeBSD.org>

Don't use "NULL" when "0" is really meant.


# f0184ff8 31-May-2002 Archie Cobbs <archie@FreeBSD.org>

Fix GCC warnings caused by initializing a zero length array. In the process,
simply things a bit by getting rid of 'struct ng_parse_struct_info' which
was useless because it only contained one field.

MFC after: 2 weeks


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# dc9c2e01 05-Jan-2002 Archie Cobbs <archie@FreeBSD.org>

Avoid reentrantly sending on the same socket, which causes a kernel panic.


# 9c4d63da 31-Dec-2001 Robert Watson <rwatson@FreeBSD.org>

o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread
argument.

o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by: freebsd-arch


# 6e551fb6 10-Dec-2001 David E. O'Brien <obrien@FreeBSD.org>

Update to C99, s/__FUNCTION__/__func__/,
also don't use ANSI string concatenation.


# 19ff9e5f 28-Nov-2001 Archie Cobbs <archie@FreeBSD.org>

When a socket is not connected, allow the peer "struct sockaddr"
to be included in the meta information that is associated with
incoming and outgoing packets.

Reviewed by: julian
MFC after: 1 week


# 129bc895 10-Oct-2001 Archie Cobbs <archie@FreeBSD.org>

Let "raw" mean IPPROTO_RAW instead of IPPROTO_IP.

Noticed by: jdp
MFC after: 3 days


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# f97e0a07 07-Sep-2001 Julian Elischer <julian@FreeBSD.org>

First pass at porting John's "accept" changes to
allow an in-kernel webserver (or similar) to accept
and handle incoming connections using netgraph without ever leaving the
kernel. (allows incoming tunnel requests to be
handled totally within the kernel for example)

Needs work, but shouldn't break existing functionality.

Submitted by: John Polstra <jdp@polstra.com>
MFC after: 2 weeks


# 93caaaa7 16-Feb-2001 Archie Cobbs <archie@FreeBSD.org>

Fix an erroneous comment and two style(9) bugs.


# 9c8c302f 10-Jan-2001 Julian Elischer <julian@FreeBSD.org>

Fix some memory leaks
Add memory leak detection assitance.


# 30400f03 07-Jan-2001 Julian Elischer <julian@FreeBSD.org>

Part 2 of the netgraph rewrite.
This is mostly cosmetic changes, (though I caught a bug or two while
makeing them)
Reviewed by: archie@freebsd.org


# 069154d5 05-Jan-2001 Julian Elischer <julian@FreeBSD.org>

Rewrite of netgraph to start getting ready for SMP.
This version is functional and is aproaching solid..
notice I said APROACHING. There are many node types I cannot test
I have tested: echo hole ppp socket vjc iface tee bpf async tty
The rest compile and "Look" right. More changes to follow.
DEBUGGING is enabled in this code to help if people have problems.


# 589f6ed8 18-Dec-2000 Julian Elischer <julian@FreeBSD.org>

Divorce the kernel binary ABI version number from the message
format version number. (userland programs should not need to be
recompiled when the netgraph kernel internal ABI is changed.

Also fix modules that don;t handle the fact that a caller may not supply
a return message pointer. (benign at the moment because the calling code
checks, but that will change)


# 859a4d16 12-Dec-2000 Julian Elischer <julian@FreeBSD.org>

Reviewed by: Archie@freebsd.org
This clears out my outstanding netgraph changes.
There is a netgraph change of design in the offing and this is to some
extent a superset of soem of the new functionality and some of the old
functionality that may be removed.

This code works as before, but allows some new features that I want to
work with and evaluate. It is the basis for a version of netgraph
with integral locking for SMP use.

This is running on my test machine with no new problems :-)


# 99cdf4cc 18-Nov-2000 David Malone <dwmalone@FreeBSD.org>

Add the use of M_ZERO to netgraph.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>
Submitted by: archie
Approved by: archie


# 27121ab1 16-Nov-2000 Brian Somers <brian@FreeBSD.org>

Go back to using data_len in struct ngpppoe_init_data after discussions
with Julian and Archie.

Implement a new ``sizedstring'' parse type for dealing with field pairs
consisting of a uint16_t followed by a data field of that size, and use
this to deal with the data_len and data fields.

Written by: Archie with some input by me
Agreed in principle by: julian


# cc3bbd68 24-Oct-2000 Julian Elischer <julian@FreeBSD.org>

Since neither archie nor I work at Whistle any more, change our email
addresses to be the more usefu @freebsd.org ones
so we can keep getting bug-reports.
- man pages to follow..


# be731c30 11-Oct-2000 Archie Cobbs <archie@FreeBSD.org>

Fix memory leak.

Submitted by: Christopher N. Harrell <cnh@ivmg.net>


# 65b9a0da 21-Sep-2000 Archie Cobbs <archie@FreeBSD.org>

Allocate all memory (including within node constructors) with M_NOWAIT
instead of M_WAITOK, to allow for maximum flexibility.


# 57b57be3 10-Aug-2000 Archie Cobbs <archie@FreeBSD.org>

Take advantage of the new unsigned and hex integer types.


# 1baeddb8 09-Aug-2000 Archie Cobbs <archie@FreeBSD.org>

In a struct sockaddr, sa->sa_len can be zero if uninitialized.
Make sure that this doesn't cause a problem when parsing.


# a4ec03cf 28-Apr-2000 Julian Elischer <julian@FreeBSD.org>

Two simple changes to the kernel internal API for netgraph modules,
to support future work in flow-control and 'packet reject/replace'
processing modes.

reviewed by: phk, archie


# 647b86df 06-Dec-1999 Julian Elischer <julian@FreeBSD.org>

Remove a bunch of un-needed includes.
Submitted by: phk@freebsd.org


# f8307e12 29-Nov-1999 Archie Cobbs <archie@FreeBSD.org>

Add two new generic control messages, NGM_ASCII2BINARY and
NGM_BINARY2ASCII, which convert control messages to ASCII and back.
This allows control messages to be sent and received in ASCII form
using ngctl(8), which makes ngctl a lot more useful.

This also allows all the type-specific debugging code in libnetgraph
to go away -- instead, we just ask the node itself to do the ASCII
translation for us.

Currently, all generic control messages are supported, as well as
messages associated with the following node types: async, cisco,
ksocket, and ppp.

See /usr/share/examples/netgraph/ngctl for an example of using this.

Also give ngctl(8) the ability to print out incoming data and
control messages at any time. Eventually nghook(8) may be subsumed.

Several other misc. bug fixes.

Reviewed by: julian


# 25792ef3 23-Nov-1999 Archie Cobbs <archie@FreeBSD.org>

Change the prototype of the strto* routines to make the second
parameter a char ** instead of a const char **. This make these
kernel routines consistent with the corresponding libc userland
routines.

Which is actually 'correct' is debatable, but consistency and
following the spec was deemed more important in this case.

Reviewed by (in concept): phk, bde


# 19bff684 18-Nov-1999 Archie Cobbs <archie@FreeBSD.org>

Add some safety using KASSERT() and splnet().


# cb3c7a5d 16-Nov-1999 Archie Cobbs <archie@FreeBSD.org>

New netgraph node type "ksocket".

Obtained from: Whistle source tree