Cross Reference: /freebsd-current/sys/sys/socketvar.h

History log of /freebsd-current/sys/sys/socketvar.h
Revision	Date	Author	Comments
# 81b4d1c4	08-Apr-2024	Stephen J. Kiernan <stevek@FreeBSD.org>	sockets: Add hhook in sonewconn for inheriting OSD specific data Added HHOOK_SOCKET_NEWCONN and bumped HHOOK_SOCKET_LAST Reviewed by: glebius, tuexen Obtained from: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D44632
# ce69e373	03-Feb-2024	Gleb Smirnoff <glebius@FreeBSD.org>	Revert "sockets: retire sorflush()" Provide a comment in sorflush() why the socket I/O sx(9) lock is actually important. This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.
# f79a8585	30-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: garbage collect SS_ISCONFIRMING Fixes: 8df32b19dee92b5eaa4b488ae78dca6accfcb38e
# 507f87a7	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: retire sorflush() With removal of dom_dispose method the function boils down to two meaningful function calls: socantrcvmore() and sbrelease(). The latter is only relevant for protocols that use generic socket buffers. The socket I/O sx(9) lock acquisition in sorflush() is not relevant for shutdown(2) operation as it doesn't do any I/O that may interleave with read(2) or write(2). The socket buffer mutex acquisition inside sbrelease() is what guarantees thread safety. This sx(9) acquisition in soshutdown() can be tracked down to 4.4BSD times, where it used to be sblock(), and it was carried over through the years evolving together with sockets with no reconsideration of why do we carry it over. I can't tell if that sblock() made sense back then, but it doesn't make any today. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43415
# c3276e02	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make shutdown(2) how argument a enum Reviwed by: tuexen Differential Revision: https://reviews.freebsd.org/D43412
# 0fac350c	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on getpeername/getsockname Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses. Although, these two syscalls aren't performance critical, this change generalizes some code between 3 syscalls trimming code size. Following example of accept(2), provide VNET-aware and INVARIANT-checking wrappers sopeeraddr() and sosockaddr() around protosw methods. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D42694
# cfb1e929	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on accept(2) Let the accept functions provide stack memory for protocols to fill it in. Generic code should provide sockaddr_storage, specialized code may provide smaller structure. While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting required length in case if provided length was insufficient. Our manual page accept(2) and POSIX don't explicitly require that, but one can read the text as they do. Linux also does that. Update tests accordingly. Reviewed by: rscheff, tuexen, zlei, dchagin Differential Revision: https://reviews.freebsd.org/D42635
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 95ee2897	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: two-line .h pattern Remove /^\s\\n \*\s+\$FreeBSD\$$\n/
# 7a2c93b8	14-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: provide sousrsend() that does socket specific error handling Sockets have special handling for EPIPE on a write, that was spread out into several places. Treating transient errors is also special - if protocol is atomic, than we should ignore any changes to uio_resid, a transient error means the write had completely failed (see d2b3a0ed31e). - Provide sousrsend() that expects a valid uio, and leave sosend() for kernel consumers only. Do all special error handling right here. - In dofilewrite() don't do special handling of error for DTYPE_SOCKET. - For send(2), write(2) and aio_write(2) call into sousrsend() and remove error handling for kern_sendit(), soo_write() and soaio_process_job(). PR: 265087 Reported by: rz-rpi03 at h-ka.de Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35863
# 3be2225f	10-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	Remove fflag argument from getsock_cap Interested callers can obtain in other own easily enough and there is no reason to branch on it.
# 3212ad15	07-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	Add getsock All but one consumers of getsock_cap only pass 4 arguments. Take advantage of it.
# e80062a2	08-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: avoid call to soisconnected() on transition to ESTABLISHED This call existed since pre-FreeBSD times, and it is hard to understand why it was there in the first place. After 6f3caa6d815 it definitely became necessary always and commit message from f1ee30ccd60 confirms that. Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call appears to be useful only for sockets that landed on the incomplete queue, e.g. sockets that have accept_filter(9) enabled on them. Provide a new TCP flag to mark connections that are known to be on the incomplete queue, and call soisconnected() only for those connections. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36488
# 07285bb4	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: utilize new solisten_clone() and solisten_enqueue() This streamlines cloning of a socket from a listener. Now we do not drop the inpcb lock during creation of a new socket, do not do useless state transitions, and put a fully initialized socket+inpcb+tcpcb into the listen queue. Before this change, first we would allocate the socket and inpcb+tcpcb via tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock pcb and put this onto incomplete queue (see 6f3caa6d815). Then, after sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED, insert into inpcb hash, finalize initialization of tcpcb. And then, in call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call soisconnected(). This call would lock the listening socket once again with a LOR protection sequence and then we would relocate the socket onto the complete queue and only now it is ready for accept(2). Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36064
# 8f5a0a2e	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: provide solisten_clone(), solisten_enqueue() as alternative KPI to sonewconn(). The latter has three stages: - check the listening socket queue limits - allocate a new socket - call into protocol attach method - link the new socket into the listen queue of the listening socket The attach method, originally designed for a creation of socket by the socket(2) syscall has slightly different semantics than attach of a socket cloned by listener. Make it possible for protocols to call into the first stage, then perform a different attach, and then call into the final stage. The first stage, that checks limits and clones a socket is called solisten_clone(), and the function that enqueues the socket is solisten_enqueue(). Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D36063
# d8596171	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use only soref()/sorele() as socket reference count o Retire SS_FDREF as it is basically a debug flag on top of already existing soref()/sorele(). o Convert SS_PROTOREF into soref()/sorele(). o Change reference model for the listen queues, see below. o Make sofree() private. The correct KPI to use is only sorele(). o Make soabort() respect the model and sorele() instead of sofree(). Note on listening queues. Until now the sockets on a queue had zero reference count. And the reference were given only upon accept(2). The assumption was that there is no way to see the queued socket from anywhere except its head. This is not true, since queued sockets already have pcbs, which are linked at least into the global pcb lists. With this change we put the reference right in the sonewconn() and on accept(2) path we just hand the existing reference to the file descriptor. Differential revision: https://reviews.freebsd.org/D35679
# bc760564	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use positive flag for file descriptor socket reference Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations. Mark sockets created by socreate() with SS_FDREF. This change is mostly illustrative. With it we see that SS_FDREF is a debugging flag, since: * socreate() takes a reference with soref(). * on accept path solisten_dequeue() takes a reference with soref() and then soaccept() sets SS_FDREF. * soclose() checks SS_FDREF, removes it and does sorele(). Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D35678
# f6379f7f	16-Jun-2022	Mark Johnston <markj@FreeBSD.org>	socket: Fix a race between kevent(2) and listen(2) When locking the knote list for a socket, we check whether the socket is a listening socket in order to select the appropriate mutex; a listening socket uses the socket lock, while data sockets use socket buffer mutexes. If SOLISTENING(so) is false and the knote lock routine locks a socket buffer, then it must re-check whether the socket is a listening socket since solisten_proto() could have changed the socket's identity while we were blocked on the socket buffer lock. Reported by: syzkaller Reviewed by: glebius MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35492
# 8c0d1eca	30-May-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	sockbuf: retain backward compatibility with userland after d59bc188d652 Add spare fields to xsockbuf in place of sb_mcnt / sb_ccnt to avoid rebuilding userland binaries like sockstat(1). Reviewed by: glebius
# d59bc188	27-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockbuf: remove unused mbuf counter and cluster counter With M_EXTPG mbufs these two counters already do not represent the reality. As we are moving towards protocol independent socket buffers, which may not even use mbufs at all, the counters become less and less relevant. The only userland seeing them was 'netstat -x'. PR: 264181 (exp-run) Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35334
# 43283184	12-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use socket buffer mutexes in struct socket directly Since c67f3b8b78e the sockbuf mutexes belong to the containing socket, and socket buffers just point to it. In 74a68313b50 macros that access this mutex directly were added. Go over the core socket code and eliminate code that reaches the mutex by dereferencing the sockbuf compatibility pointer. This change requires a KPI change, as some functions were given the sockbuf pointer only without any hint if it is a receive or send buffer. This change doesn't cover the whole kernel, many protocols still use compatibility pointers internally. However, it allows operation of a protocol that doesn't use them. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35152
# a982ce04	09-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: remove the socket-on-stack hack from sorflush() The hack can be tracked down to 4.4BSD, where copy was performed under splimp() and then after splx() dom_dispose was called. Stevens has a chapter on this function, but he doesn't answer why this trick is necessary. Why can't we call into dom_dispose under splimp()? Anyway, with multithreaded kernel the hack seems to be necessary to avoid LORs between socket buffer lock and different filesystem locks, especially network file systems. The new socket buffers KPI sbcut() from 1d2df300e9b allow us to get rid of the hack. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35125
# 97f8198e	09-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make SO_SND/SO_RCV a enum Not a functional change now. The enum will also be used for other socket buffer related KPIs.
# e3ba94d4	09-Nov-2021	John Baldwin <jhb@FreeBSD.org>	Don't require the socket lock for sorele(). Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741
# fa0463c3	14-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: De-duplicate SBLOCKWAIT() definitions MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 74a68313	10-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Add macros to lock socket buffers using socket references Since commit c67f3b8b78e50c6df7c057d6cf108e4d6b4312d0 the sockbuf mutexes belong to the containing socket. Sockbufs contain a pointer to a mutex, which by default is initialized to the corresponding mutexes in the socket. The SOCKBUF_LOCK() etc. macros operate on this pointer. However, the pointer is clobbered by listen(2) so it's not safe to use them unless one is sure that the socket is not a listening socket. This change introduces a new set of macros which lock socket buffers through the socket. This is a bit cheaper since it removes the pointer indirection, and allows one to safely lock socket buffers and then check for a listening socket. For MFC, these macros should be reimplemented in terms of the existing socket buffer layout. Reviewed by: tuexen, gallatin, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31900
# bd4a39cc	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Properly interlock when transitioning to a listening socket Currently, most protocols implement pru_listen with something like the following: SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so); solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy. The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2). This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it. Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659
# c67f3b8b	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Move sockbuf mutexes into the owning socket This is necessary to provide proper interlocking with listen(2), which destroys the socket buffers. Otherwise, code must lock the socket itself and check SOLISTENING(so), but most I/O paths do not otherwise need to acquire the socket lock, so the extra overhead needed to check a rare error case is undesirable. listen(2) calls are relatively rare. Thus, the strategy is to have it acquire all socket buffer locks when transitioning to a listening socket. To do this safely, these locks must be stable, and not destroyed during listen(2) as they are today. So, move them out of the sockbuf and into the owning socket. For the sockbuf mutexes, keep a pointer to the mutex in the sockbuf itself, for now. This can be removed by replacing SOCKBUF_LOCK() etc. with macros which operate on the socket itself, as was done for the sockbuf I/O locks. Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31658
# f94acf52	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND\|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657
# 97cf43eb	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Reorder socket and sockbuf fields to eliminate some padding This is in preparation for moving sockbuf locks into the owning socket, in order to provide proper interlocking for listen(2). In particular, listening sockets do not use the socket buffers and repurpose that space in struct socket for their own purposes. Moving the locks out of the socket buffers and into the socket proper makes it possible to safely lock socket buffers and test for a listening socket before deciding how to proceed. Reordering these fields saves some space and helps ensure that UMA will provide the same space efficiency for sockets as before. No functional change intended. Reviewed by: tuexen, gallatin Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31656
# 7045b160	28-Jul-2021	Roy Marples <roy@marples.name>	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652
# a1002174	14-Jun-2021	Mark Johnston <markj@FreeBSD.org>	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# f187d6df	15-Mar-2021	Kyle Evans <kevans@FreeBSD.org>	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.
# 74ae3f3e	14-Mar-2021	Kyle Evans <kevans@FreeBSD.org>	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)
# 924d1c9a	08-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit 4a01b854ca5c2e5124958363b3326708b913af71.
# 4a01b854	07-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.
# 3c0e5685	23-Jul-2020	John Baldwin <jhb@FreeBSD.org>	Add support for KTLS RX via software decryption. Allow TLS records to be decrypted in the kernel after being received by a NIC. At a high level this is somewhat similar to software KTLS for the transmit path except in reverse. Protocols enqueue mbufs containing encrypted TLS records (or portions of records) into the tail of a socket buffer and the KTLS layer decrypts those records before returning them to userland applications. However, there is an important difference: - In the transmit case, the socket buffer is always a single "record" holding a chain of mbufs. Not-yet-encrypted mbufs are marked not ready (M_NOTREADY) and released to protocols for transmit by marking mbufs ready once their data is encrypted. - In the receive case, incoming (encrypted) data appended to the socket buffer is still a single stream of data from the protocol, but decrypted TLS records are stored as separate records in the socket buffer and read individually via recvmsg(). Initially I tried to make this work by marking incoming mbufs as M_NOTREADY, but there didn't seemed to be a non-gross way to deal with picking a portion of the mbuf chain and turning it into a new record in the socket buffer after decrypting the TLS record it contained (along with prepending a control message). Also, such mbufs would also need to be "pinned" in some way while they are being decrypted such that a concurrent sbcut() wouldn't free them out from under the thread performing decryption. As such, I settled on the following solution: - Socket buffers now contain an additional chain of mbufs (sb_mtls, sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by the protocol layer. These mbufs are still marked M_NOTREADY, but soreceive*() generally don't know about them (except that they will block waiting for data to be decrypted for a blocking read). - Each time a new mbuf is appended to this TLS mbuf chain, the socket buffer peeks at the TLS record header at the head of the chain to determine the encrypted record's length. If enough data is queued for the TLS record, the socket is placed on a per-CPU TLS workqueue (reusing the existing KTLS workqueues and worker threads). - The worker thread loops over the TLS mbuf chain decrypting records until it runs out of data. Each record is detached from the TLS mbuf chain while it is being decrypted to keep the mbufs "pinned". However, a new sb_dtlscc field tracks the character count of the detached record and sbcut()/sbdrop() is updated to account for the detached record. After the record is decrypted, the worker thread first checks to see if sbcut() dropped the record. If so, it is freed (can happen when a socket is closed with pending data). Otherwise, the header and trailer are stripped from the original mbufs, a control message is created holding the decrypted TLS header, and the decrypted TLS record is appended to the "normal" socket buffer chain. (Side note: the SBCHECK() infrastucture was very useful as I was able to add assertions there about the TLS chain that caught several bugs during development.) Tested by: rmacklem (various versions) Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24628
# cb7c78fd	21-May-2020	Mark Johnston <markj@FreeBSD.org>	Fix ACCEPT_FILTER_DEFINE to pass the version to MODULE_VERSION. MFC with: r361263
# 591b09b4	19-May-2020	Mark Johnston <markj@FreeBSD.org>	Define a module version for accept filter modules. Otherwise accept filters compiled into the kernel do not preempt preloaded accept filter modules. Then, the preloaded file registers its accept filter module before the kernel, and the kernel's attempt fails since duplicate accept filter list entries are not permitted. This causes the preloaded file's module to be released, since module_register_init() does a lookup by name, so the preloaded file is unloaded, and the accept filter's callback points to random memory since preload_delete_name() unmaps the file on x86 as of r336505. Add a new ACCEPT_FILTER_DEFINE macro which wraps the accept filter and module definitions, and ensures that a module version is defined. PR: 245870 Reported by: Thomas von Dein <freebsd@daemon.de> MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 0326eec5	17-Apr-2020	Simon J. Gerraty <sjg@FreeBSD.org>	Define enum for so_qstate outside of struct. LLVM-9.0 clang++ throws an error for enum defined within an anonymous struct. Reviewed by: jtl, rpokala MFC after: 1 week Differential Revision: https://reviews.freebsd.org//D24477
# fb401f1b	14-Apr-2020	Jonathan T. Looney <jtl@FreeBSD.org>	Make sonewconn() overflow messages have per-socket rate-limits and values. sonewconn() emits debug-level messages when a listen socket's queue overflows. Currently, sonewconn() tracks overflows on a global basis. It will only log one message every 60 seconds, regardless of how many sockets experience overflows. And, when it next logs at the end of the 60 seconds, it records a single message referencing a single PCB with the total number of overflows across all sockets. This commit changes to per-socket overflow tracking. The code will now log one message every 60 seconds per socket. And, the code will provide per-socket queue length and overflow counts. It also provides a way to change the period between log messages using a sysctl. Reviewed by: jhb (previous version), bcr (manpages) MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24316
# 715eb7e7	08-Mar-2019	Bjoern A. Zeeb <bz@FreeBSD.org>	Try to improve comment for socket state bits. In r324227 the comment moved into socketvar.h originally from sockstate.h r180948. Try to improve English and as a consequence rewrap the comment. No functional changes. Reviewed by: jhb (a wording suggestion) Differential Revision: https://reviews.freebsd.org/D13865
# 938e8dcf	06-Nov-2018	Brooks Davis <brooks@FreeBSD.org>	Regen after r340199: Use declared types for caddr_t arguments. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852
# cc796319	03-Aug-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Now that after r335979 the kernel addresses in API structures are fixed size, there is no reason left for the unions. Discussed with: brooks
# f38b68ae	05-Jul-2018	Brooks Davis <brooks@FreeBSD.org>	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386
# 1fbe13cf	08-Jun-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Add a socket destructor callback. This allows kernel providers to set callbacks to perform additional cleanup actions at the time a socket is closed. Michio Honda presented a use for this at BSDCan 2018. (See https://www.bsdcan.org/2018/schedule/events/965.en.html .) Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version) Reviewed by: lstewart (previous version) Differential Revision: https://reviews.freebsd.org/D15706
# 1a43cff9	06-Jun-2018	Sean Bruno <sbruno@FreeBSD.org>	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003
# 7875017c	24-Apr-2018	Sean Bruno <sbruno@FreeBSD.org>	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks
# 7b7796ee	23-Apr-2018	Sean Bruno <sbruno@FreeBSD.org>	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003
# 78e3e2a4	07-Dec-2017	Bjoern A. Zeeb <bz@FreeBSD.org>	Use correct field in the description for the lock after r319722. Reviewed by: glebius Sponsored by: iXsystems, Inc.
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 0e229f34	02-Oct-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Hide struct socket and struct unpcb from the userland. Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and are not guaranteed for stability of the structures. The violators list is the the usual one: libprocstat(3) and netstat(1) internally and lsof in ports. In struct xunpcb remove the inclusion of kernel structure and add a bunch of spare fields. The xsocket already has socket not included, but add there spares as well. Embed xsockbuf into xsocket. Sort declarations in sys/socketvar.h to separate kernel only from userland available ones. PR: 221820 (exp-run)
# 779f106a	08-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770
# b3244df7	06-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Provide typedef for socket upcall function. While here change so_gen_t type to modern uint64_t.
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# f3e7afe2	18-Jan-2017	Hans Petter Selasky <hselasky@FreeBSD.org>	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months
# 339efd75	16-Jan-2017	Maxim Sobolev <sobomax@FreeBSD.org>	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171
# 85b0f9de	22-Sep-2016	Mariusz Zaborski <oshogbo@FreeBSD.org>	capsicum: propagate rights on accept(2) Descriptor returned by accept(2) should inherits capabilities rights from the listening socket. PR: 201052 Reviewed by: emaste, jonathan Discussed with: many Differential Revision: https://reviews.freebsd.org/D7724
# f22bfc72	23-Jun-2016	Navdeep Parhar <np@FreeBSD.org>	Add spares to struct ifnet and socket for packet pacing and/or general use. Update comments regarding the spare fields in struct inpcb. Bump __FreeBSD_version for the changes to the size of the structures. Reviewed by: gnn@ Approved by: re@ (gjb@) Sponsored by: Chelsio Communications
# e9877429	18-May-2016	Gleb Smirnoff <glebius@FreeBSD.org>	The SA-16:19 wouldn't have happened if the sockargs() had properly typed argument for length. While here make it static and convert to ANSI C. Reviewed by: C Turt
# 5163d2ec	29-Apr-2016	John Baldwin <jhb@FreeBSD.org>	Expose soaio_enqueue(). This can be used by protocol-specific AIO handlers to queue work to the socket AIO daemon pool. Sponsored by: Chelsio Communications
# f3215338	01-Mar-2016	John Baldwin <jhb@FreeBSD.org>	Refactor the AIO subsystem to permit file-type-specific handling and improve cancellation robustness. Introduce a new file operation, fo_aio_queue, which is responsible for queueing and completing an asynchronous I/O request for a given file. The AIO subystem now exports library of routines to manipulate AIO requests as well as the ability to run a handler function in the "default" pool of AIO daemons to service a request. A default implementation for file types which do not include an fo_aio_queue method queues requests to the "default" pool invoking the fo_read or fo_write methods as before. The AIO subsystem permits file types to install a private "cancel" routine when a request is queued to permit safe dequeueing and cleanup of cancelled requests. Sockets now use their own pool of AIO daemons and service per-socket requests in FIFO order. Socket requests will not block indefinitely permitting timely cancellation of all requests. Due to the now-tight coupling of the AIO subsystem with file types, the AIO subsystem is now a standard part of all kernels. The VFS_AIO kernel option and aio.ko module are gone. Many file types may block indefinitely in their fo_read or fo_write callbacks resulting in a hung AIO daemon. This can result in hung user processes (when processes attempt to cancel all outstanding requests during exit) or a hung system. To protect against this, AIO requests are only permitted for known "safe" files by default. AIO requests for all file types can be enabled by setting the new vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have been updated to skip operations on unsafe file types if the sysctl is zero. Currently, AIO requests on sockets and raw disks are considered safe and are enabled by default. aio_mlock() is also enabled by default. Reviewed by: cem, jilles Discussed with: kib (earlier version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5289
# 1662c2ae	16-Feb-2016	John Baldwin <jhb@FreeBSD.org>	The locking annotations for struct sockbuf originally used the key from struct socket. When sockbuf.h was moved out of socketvar.h, the locking key was no longer nearby. Instead, add a new key for sockbuf and use a single item for the socket buffer lock instead of separate entries for receive vs send buffers. Reviewed by: adrian Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D4901
# 5652770d	05-Feb-2016	John Baldwin <jhb@FreeBSD.org>	Rename aiocblist to kaiocb and use consistent variable names. Typically <foo>list is used for a structure that holds a list head in FreeBSD, not for members of a list. As such, rename 'struct aiocblist' to 'struct kaiocb' (the kernel version of 'struct aiocb'). While here, use more consistent variable names for AIO control blocks: - Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job objects. - Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE(). - Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs. - Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold a user pointer to a 'struct aiocb'. - Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'. Reviewed by: kib Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5125
# 7325dfbb	01-Feb-2016	Alfred Perlstein <alfred@FreeBSD.org>	Increase max allowed backlog for listen sockets from short to int. PR: 203922 Submitted by: White Knight <white_knight@2ch.net> MFC After: 4 weeks
# e370f90a	16-Aug-2015	Xin LI <delphij@FreeBSD.org>	so_vnet is constant after creation and no locking is necessary, document this fact. (netmap have an assignment too but that socket object is on stack). MFC after: 2 weeks
# 25742185	11-Apr-2015	Mateusz Guzik <mjg@FreeBSD.org>	Replace struct filedesc argument in getsock_cap with struct thread This is is a step towards removal of spurious arguments.
# cfa6009e	12-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	In preparation of merging projects/sendfile, transform bare access to sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 80b47aef	09-Oct-2014	Marcel Moolenaar <marcel@FreeBSD.org>	Move the SCTP syscalls to netinet with the rest of the SCTP code. The syscalls themselves are tightly coupled with the network stack and therefore should not be in the generic socket code. The following four syscalls have been marked as NOSTD so they can be dynamically registered in sctp_syscalls_init() function: sys_sctp_peeloff sys_sctp_generic_sendmsg sys_sctp_generic_sendmsg_iov sys_sctp_generic_recvmsg The syscalls are also set up to be dynamically registered when COMPAT32 option is configured. As a side effect of moving the SCTP syscalls, getsock_cap needs to be made available outside of the uipc_syscalls.c source file. A proper prototype has been added to the sys/socketvar.h header file. API tests from the SCTP reference implementation have been run to ensure compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout) Submitted by: Steve Kiernan <stevek@juniper.net> Reviewed by: tuexen, rrs Obtained from: Juniper Networks, Inc.
# 4ec73712	18-Aug-2014	Marcel Moolenaar <marcel@FreeBSD.org>	For vendors like Juniper, extensibility for sockets is important. A good example is socket options that aren't necessarily generic. To this end, OSD is added to the socket structure and hooks are defined for key operations on sockets. These are: o soalloc() and sodealloc() o Get and set socket options o Socket related kevent filters. One aspect about hhook that appears to be not fully baked is the return semantics (the return value from the hook is ignored in hhook_run_hooks() at the time of commit). To support return values, the socket_hhook_data structure contains a 'status' field to hold return values. Submitted by: Anuranjan Shukla <anshukla@juniper.net> Obtained from: Juniper Networks, Inc.
# 3846a822	16-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
# 237abf0c	28-Jun-2013	Davide Italiano <davide@FreeBSD.org>	- Trim an unused and bogus Makefile for mount_smbfs. - Reconnect with some minor modifications, in particular now selsocket() internals are adapted to use sbintime units after recent'ish calloutng switch.
# f89d4c3a	06-May-2013	Andre Oppermann <andre@FreeBSD.org>	Back out r249318, r249320 and r249327 due to a heisenbug most likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
# 18ba072a	10-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Fix build.
# 7493f24e	02-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des
# 8713f68a	08-Dec-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	The socket_zone UMA zone is now private to uipc_socket.c.
# 2e564269	17-Oct-2012	Attilio Rao <attilio@FreeBSD.org>	Disconnect non-MPSAFE SMBFS from the build in preparation for dropping GIANT from VFS. In addition, disconnect also netsmb, which is a base requirement for SMBFS. In the while SMBFS regular users can use FUSE interface and smbnetfs port to work with their SMBFS partitions. Also, there are ongoing efforts by vendor to support in-kernel smbfs, so there are good chances that it will get relinked once properly locked. This is not targeted for MFC.
# 5c9d0a9a	12-Nov-2010	Luigi Rizzo <luigi@FreeBSD.org>	This commit implements the SO_USER_COOKIE socket option, which lets you tag a socket with an uint32_t value. The cookie can then be used by the kernel for various purposes, e.g. setting the skipto rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has been implemented; however there is nothing ipfw-specific in its implementation). The ipfw-related code that uses the optopn will be committed separately. This change adds a field to 'struct socket', but the struct is not part of any driver or userland-visible ABI so the change should be harmless. See the discussion at http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html Idea and code from Paul Joe, small modifications and manpage changes by myself. Submitted by: Paul Joe MFC after: 1 week
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# adb6aa9a	18-Sep-2010	Robert Watson <rwatson@FreeBSD.org>	With reworking of the socket life cycle in 7.x, the need for a "sotryfree()" was eliminated: all references to sockets are explicitly managed by sorele() and the protocols. As such, garbage collect sotryfree(), and update sofree() comments to make the new world order more clear. MFC after: 3 days Reported by: Anuranjan Shukla <anshukla at juniper dot net>
# 1a996ed1	18-Jul-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Revert r210225 - turns out I was wrong; the "/*-" is not license-only thing; it's also used to indicate that the comment should not be automatically rewrapped. Explained by: cperciva@
# 805cc58a	18-Jul-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright occurences from sys/sys/ and sys/kern/.
# 7f5dff50	07-Jul-2009	Konstantin Belousov <kib@FreeBSD.org>	Fix poll(2) and select(2) for named pipes to return "ready for read" when all writers, observed by reader, exited. Use writer generation counter for fifo, and store the snapshot of the fifo generation in the f_seqcount field of struct file, that is otherwise unused for fifos. Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is equal to fifo' fi_wgen, and revert r89376. Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them. Note that the patch does not fix not returning POLLHUP for fifos. PR: kern/94772 Submitted by: bde (original version) Reviewed by: rwatson, jilles Approved by: re (kensmith) MFC after: 6 weeks (might be)
# ef760e6a	22-Jun-2009	Andre Oppermann <andre@FreeBSD.org>	Add soreceive_stream(), an optimized version of soreceive() for stream (TCP) sockets. It is functionally identical to generic soreceive() but has a number stream specific optimizations: o does only one sockbuf unlock/lock per receive independent of the length of data to be moved into the uio compared to soreceive() which unlocks/locks per mbuf. o uses m_mbuftouio() instead of its own copy(out) variant. o much more compact code flow as a large number of special cases is removed. o much improved reability. It offers significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done. This function was written by "reverse engineering" and is not just a stripped down variant of soreceive(). It is not yet enabled by default on TCP sockets. Instead it is commented out in the protocol initialization in tcp_usrreq.c until more widespread testing has been done. Testers, especially with 10GigE gear, are welcome. MFP4: r164817 //depot/user/andre/soreceive_stream/
# 74fb0ba7	01-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Rework socket upcalls to close some races with setup/teardown of upcalls. - Each socket upcall is now invoked with the appropriate socket buffer locked. It is not permissible to call soisconnected() with this lock held; however, so socket upcalls now return an integer value. The two possible values are SU_OK and SU_ISCONNECTED. If an upcall returns SU_ISCONNECTED, then the soisconnected() will be invoked on the socket after the socket buffer lock is dropped. - A new API is provided for setting and clearing socket upcalls. The API consists of soupcall_set() and soupcall_clear(). - To simplify locking, each socket buffer now has a separate upcall. - When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from the receive socket buffer automatically. Note that a SO_SND upcall should never return SU_ISCONNECTED. - All this means that accept filters should now return SU_ISCONNECTED instead of calling soisconnected() directly. They also no longer need to explicitly clear the upcall on the new socket. - The HTTP accept filter still uses soupcall_set() to manage its internal state machine, but other accept filters no longer have any explicit knowlege of socket upcall internals aside from their return value. - The various RPC client upcalls currently drop the socket buffer lock while invoking soreceive() as a temporary band-aid. The plan for the future is to add a new flag to allow soreceive() to be called with the socket buffer locked. - The AIO callback for socket I/O is now also invoked with the socket buffer locked. Previously sowakeup() would drop the socket buffer lock only to call aio_swake() which immediately re-acquired the socket buffer lock for the duration of the function call. Discussed with: rwatson, rmacklem
# f6dfe47a	30-Apr-2009	Marko Zec <zec@FreeBSD.org>	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 849cca9b	31-Jul-2008	Kip Macy <kmacy@FreeBSD.org>	move sockbuf locking macros in to sockbuf.h
# 66a4ba62	29-Jul-2008	Kip Macy <kmacy@FreeBSD.org>	Factor sockbuf, sockopt, and sockstate out of socketvar.h in to separate headers. Reviewed by: rwatson MFC after: 3 days
# 5df3e839	02-Jul-2008	Robert Watson <rwatson@FreeBSD.org>	Add soreceive_dgram(9), an optimized socket receive function for use by datagram-only protocols, such as UDP. This version removes use of sblock(), which is not required due to an inability to interlace data improperly with datagrams, as well as avoiding some of the larger loops and state management that don't apply on datagram sockets. This is experimental code, so hook it up only for UDPv4 for testing; if there are problems we may need to revise it or turn it off by default, but it offers significant performance improvements for threaded UDP applications such as BIND9, nsd, and memcached using UDP. Tested by: kris, ps
# 49f287f8	15-May-2008	George V. Neville-Neil <gnn@FreeBSD.org>	Update the kernel to count the number of mbufs and clusters (all types) used per socket buffer. Add support to netstat to print out all of the socket buffer statistics. Update the netstat manual page to describe the new -x flag which gives the extended output. Reviewed by: rwatson, julian
# 8b07e49a	09-May-2008	Julian Elischer <julian@FreeBSD.org>	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# 3f0bfccc	03-Feb-2008	Robert Watson <rwatson@FreeBSD.org>	Further clean up sorflush: - Expose sbrelease_internal(), a variant of sbrelease() with no expectations about the validity of locks in the socket buffer. - Use sbrelease_internel() in sorflush(), and as a result avoid intializing and destroying a socket buffer lock for the temporary stack copy of the actual buffer, asb. - Add a comment indicating why we do what we do, and remove an XXX since things have gotten less ugly in sorflush() lately. This makes socket close cleaner, and possibly also marginally faster. MFC after: 3 weeks
# 265de5bb	31-Jan-2008	Robert Watson <rwatson@FreeBSD.org>	Correct two problems relating to sorflush(), which is called to flush read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>
# 5e0f5cfa	17-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	Add SB_NOCOALESCE flag to disable socket buffer update in place
# ace8398d	15-Dec-2007	Jeff Roberson <jeff@FreeBSD.org>	Refactor select to reduce contention and hide internal implementation details from consumers. - Track individual selecters on a per-descriptor basis such that there are no longer collisions and after sleeping for events only those descriptors which triggered events must be rescaned. - Protect the selinfo (per descriptor) structure with a mtx pool mutex. mtx pool mutexes were chosen to preserve api compatibility with existing code which does nothing but bzero() to setup selinfo structures. - Use a per-thread wait channel rather than a global wait channel. - Hide select implementation details in a seltd structure which is opaque to the rest of the kernel. - Provide a 'selsocket' interface for those kernel consumers who wish to select on a socket when they have no fd so they no longer have to be aware of select implementation details. Tested by: kris Reviewed on: arch
# 7abab911	03-May-2007	Robert Watson <rwatson@FreeBSD.org>	sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags on each socket buffer with the socket buffer's mutex. This sleep lock is used to serialize I/O on sockets in order to prevent I/O interlacing. This change replaces the custom sleep lock with an sx(9) lock, which results in marginally better performance, better handling of contention during simultaneous socket I/O across multiple threads, and a cleaner separation between the different layers of locking in socket buffers. Specifically, the socket buffer mutex is now solely responsible for serializing simultaneous operation on the socket buffer data structure, and not for I/O serialization. While here, fix two historic bugs: (1) a bug allowing I/O to be occasionally interlaced during long I/O operations (discovere by Isilon). (2) a bug in which failed non-blocking acquisition of the socket buffer I/O serialization lock might be ignored (discovered by sam). SCTP portion of this patch submitted by rrs.
# ddca17a6	19-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Space to tab in SB_* defines to match with rest of file.
# 4e023759	19-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Maintain a pointer and offset pair into the socket buffer mbuf chain to avoid traversal of the entire socket buffer for larger offsets on stream sockets. Adjust tcp_output() make use of it. Tested by: gallatin
# 6a37f331	01-Feb-2007	Andre Oppermann <andre@FreeBSD.org>	Generic socket buffer auto sizing support, header defines, flag inheritance. MFC after: 1 month
# eaa6dfbc	01-Aug-2006	Robert Watson <rwatson@FreeBSD.org>	Reimplement socket buffer tear-down in sofree(): as the socket is no longer referenced by other threads (hence our freeing it), we don't need to set the can't send and can't receive flags, wake up the consumers, perform two levels of locking, etc. Implement a fast-path teardown, sbdestroy(), which flushes and releases each socket buffer. A manual dom_dispose of the receive buffer is still required explicitly to GC any in-flight file descriptors, etc, before flushing the buffer. This results in a 9% UP performance improvement and 16% SMP performance improvement on a tight loop of socket();close(); in micro-benchmarking, but will likely also affect CPU-bound macro-benchmark performance.
# b0668f71	24-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	soreceive_generic(), and sopoll_generic(). Add new functions sosend(), soreceive(), and sopoll(), which are wrappers for pru_sosend, pru_soreceive, and pru_sopoll, and are now used univerally by socket consumers rather than either directly invoking the old so*() functions or directly invoking the protocol switch method (about an even split prior to this commit). This completes an architectural change that was begun in 1996 to permit protocols to provide substitute implementations, as now used by UDP. Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to perform these operations on sockets -- in particular, distributed file systems and socket system calls. Architectural head nod: sam, gnn, wollman
# f9f4beac	23-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Tweak so_gencnt comment: it was once last, but that is no longer the case.
# 03b8ff0b	23-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Tweak comment for so_head: it is a pointer to the listen socket, rather than the accept socket.
# cd3a3a26	17-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have basically always violated) invariannts of soreceive(), which assume that the first mbuf pointer in a receive socket buffer can't change while the SB_LOCK sleepable lock is held on the socket buffer, which is precisely what these functions do. No current protocols invoke these functions, and removing them will help discourage them from ever being used. I should have removed them years ago, but lost track of it. MFC after: 1 week Prodded almost by accident by: peter
# b37ffd31	10-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month
# c43a4e8a	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Add a comment describing SS_PROTOREF in detail. This will eventually be in socket(9). MFC after: 3 months
# 92c07a34	16-Mar-2006	Robert Watson <rwatson@FreeBSD.org>	Change soabort() from returning int to returning void, since all consumers ignore the return value, soabort() is required to succeed, and protocols produce errors here to report multiple freeing of the pcb, which we hope to eliminate.
# cf4f9f6d	15-Mar-2006	Robert Watson <rwatson@FreeBSD.org>	Correct spelling of 0x4000 in previous commit. This one line change from a 42k patch seemed easier to retype than apply, but apparently not. :-) Submitted by: pjd
# 5d511d26	14-Mar-2006	Robert Watson <rwatson@FreeBSD.org>	Add SS_PROTOREF socket flag, which represents a strong reference by the protocol to the socket. Normally protocol references are weak: that is, the socket layer can tear down the socket (and hence protocol state) when it finds convenient. This flag will allow the protocol to explicitly declare to the socket layer that it is maintaining a strong reference, rather than the current implicit model associated with so_pcb pointer values and repeated attempts to possibly free the socket.
# b8ae1cd6	13-Jan-2006	Robert Watson <rwatson@FreeBSD.org>	Add sosend_dgram(), a greatly reduced and simplified version of sosend() intended for use solely with atomic datagram socket types, and relies on the previous break-out of sosend_copyin(). Changes to allow UDP to optionally use this instead of sosend() will be committed as a follow-up.
# 34333b16	02-Nov-2005	Andre Oppermann <andre@FreeBSD.org>	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005
# d374e81e	30-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
# 542d484d	08-Jul-2005	John Baldwin <jhb@FreeBSD.org>	Document that SOCK_LOCK is used to protect so_emuldata. Approved by: re (scottl)
# a59f81d2	11-Mar-2005	Robert Watson <rwatson@FreeBSD.org>	Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it now matches do_setopt_accept_filter(). Slightly reformulate the logic to match the optimistic allocation of storage for the argument in advance, and slightly expand the coverage of the socket lock.
# 0daccb9c	21-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In the current world order, solisten() implements the state transition of a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set. This change does the following: - Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto(). - Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer. This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code. Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn
# 78e43644	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where the rest of the accept filter code currently lives. MFC after: 3 days
# 41ee6cfc	30-Jan-2005	Gleb Smirnoff <glebius@FreeBSD.org>	Move sb_state to the beginning of structure, above sb_startzero member. sb_state shouldn't be erased, when socket buffer is flushed by sorflush(). When sb_state was bzero'ed, a recently set SBS_CANTRCVMORE flag was cleared. If a socket was shutdown(SHUT_RD), a subsequent read() would block on it. Reported by: Ed Maste, Gerrit Nagelhout Reviewed by: rwatson
# 90d52f2f	23-Jan-2005	Gleb Smirnoff <glebius@FreeBSD.org>	- Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from short to unsigned short. - Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX. Before this change setting somaxconn to smth above 32767 and calling listen(fd, -1) lead to a socket, which doesn't accept connections at all. Reviewed by: rwatson Reported by: Igor Sysoev
# 81158452	18-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
# b10eb615	09-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Add SOCKBUF_UNLOCK_ASSERT(), which asserts that the current thread does not hold the mutex for a socket buffer.
# dcee93dc	12-Jul-2004	David Malone <dwmalone@FreeBSD.org>	Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a a better name. I have a kern_[sg]etsockopt which I plan to commit shortly, but the arguments to these function will be quite different from so_setsockopt. Approved by: alfred
# d58d3648	12-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts. Tune the timeout from 5 seconds to 12 seconds. Provide a sysctl to show how many reconnects the NFS client has done. Seems to fix IPv6 from: kuriyama
# 91e45cce	26-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate so_gencnt field of struct socket locked by so_global_mtx.
# 63a9224f	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate so_error as being used for simple assignment and reads, and therefore not locked. Assert the socket buffer lock in sowwakeup_locked() to match sorwakeup_locked().
# d60454e3	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate which SB_ constants are for sb_flags fields.
# 927c5cea	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden locking in tcp_input() for TCP packets with urgent data pointers to hold the socket buffer lock across testing and updating oobmark from just protecting sb_state. Update socket locking annotations
# 3f11a2f3	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Introduce sbreserve_locked(), which asserts the socket buffer lock on the socket buffer having its limits adjusted. sbreserve() now acquires the lock before calling sbreserve_locked(). In soreserve(), acquire socket buffer locks across read-modify-writes of socket buffer fields, and calls into sbreserve/sbrelease; make sure to acquire in keeping with the socket buffer lock order. In tcp_mss(), acquire the socket buffer lock in the calling context so that we have atomic read-modify -write on buffer sizes.
# a34b7046	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Merge next step in socket buffer locking: - sowakeup() now asserts the socket buffer lock on entry. Move the call to KNOTE higher in sowakeup() so that it is made with the socket buffer lock held for consistency with other calls. Release the socket buffer lock prior to calling into pgsigio(), so_upcall(), or aio_swake(). Locking for this event management will need revisiting in the future, but this model avoids lock order reversals when upcalls into other subsystems result in socket/socket buffer operations. Assert that the socket buffer lock is not held at the end of the function. - Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now have _locked versions which assert the socket buffer lock on entry. If a wakeup is required by sb_notify(), invoke sowakeup(); otherwise, unconditionally release the socket buffer lock. This results in the socket buffer lock being released whether a wakeup is required or not. - Break out socantsendmore() into socantsendmore_locked() that asserts the socket buffer lock. socantsendmore() unconditionally locks the socket buffer before calling socantsendmore_locked(). Note that both functions return with the socket buffer unlocked as socantsendmore_locked() calls sowwakeup_locked() which has the same properties. Assert that the socket buffer is unlocked on return. - Break out socantrcvmore() into socantrcvmore_locked() that asserts the socket buffer lock. socantrcvmore() unconditionally locks the socket buffer before calling socantrcvmore_locked(). Note that both functions return with the socket buffer unlocked as socantrcvmore_locked() calls sorwakeup_locked() which has similar properties. Assert that the socket buffer is unlocked on return. - Break out sbrelease() into a sbrelease_locked() that asserts the socket buffer lock. sbrelease() unconditionally locks the socket buffer before calling sbrelease_locked(). sbrelease_locked() now invokes sbflush_locked() instead of sbflush(). - Assert the socket buffer lock in socket buffer sanity check functions sblastrecordchk(), sblastmbufchk(). - Assert the socket buffer lock in SBLINKRECORD(). - Break out various sbappend() functions into sbappend_locked() (and variations on that name) that assert the socket buffer lock. The !_locked() variations unconditionally lock the socket buffer before calling their _locked counterparts. Internally, make sure to call _locked() support routines, etc, if already holding the socket buffer lock. - Break out sbinsertoob() into sbinsertoob_locked() that asserts the socket buffer lock. sbinsertoob() unconditionally locks the socket buffer before calling sbinsertoob_locked(). - Break out sbflush() into sbflush_locked() that asserts the socket buffer lock. sbflush() unconditionally locks the socket buffer before calling sbflush_locked(). Update panic strings for new function names. - Break out sbdrop() into sbdrop_locked() that asserts the socket buffer lock. sbdrop() unconditionally locks the socket buffer before calling sbdrop_locked(). - Break out sbdroprecord() into sbdroprecord_locked() that asserts the socket buffer lock. sbdroprecord() unconditionally locks the socket buffer before calling sbdroprecord_locked(). - sofree() now calls socantsendmore_locked() and re-acquires the socket buffer lock on return. It also now calls sbrelease_locked(). - sorflush() now calls socantrcvmore_locked() and re-acquires the socket buffer lock on return. Clean up/mess up other behavior in sorflush() relating to the temporary stack copy of the socket buffer used with dom_dispose by more properly initializing the temporary copy, and selectively bzeroing/copying more carefully to prevent WITNESS from getting confused by improperly initialized mutexes. Annotate why that's necessary, or at least, needed. - soisconnected() now calls sbdrop_locked() before unlocking the socket buffer to avoid locking overhead. Some parts of this change were: Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# 80bd7213	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate so_state as locked with SOCK_LOCK(so). Add a commenting indicating that the SB_ constants apply to sb_flags.
# ead2ceda	15-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Fill in locking annotation for additional socket fields: so_timeo Used as a sleep/wakeup address, no locking. sb_* Almost all socket buffer fields locked with sockbuf lock for the oskcet buffer. so_cred Static after socket creation.
# 90c306cc	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Remove unneeded '-' from comment header; this comment contains only English text paragraphs that shouldn't have problems when run through indent.
# c0b99ffa	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# 310e7ceb	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
# e656d9a6	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Move #ifdef _KERNEL higher in socketvar.h to cover various socket buffer related macros.
# 395a08c9	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# 25928771	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Whitespace-only restyling of socket reference count macros.
# f6c0cce6	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Introduce a mutex into struct sockbuf, sb_mtx, which will be used to protect fields in the socket buffer. Add accessor macros to use the mutex (SOCKBUF_()). Initialize the mutex in soalloc(), and destroy it in sodealloc(). Add addition, add SOCK_() access macros which will protect most remaining fields in the socket; for the time being, use the receive socket buffer mutex to implement socket level locking to reduce memory overhead. Submitted by: sam Sponosored by: FreeBSD Foundation Obtained from: BSD/OS
# 0d017b27	11-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Use tabs instead of spaces between #define and macro name; a merge mistake as they are in rwatson_netperf.
# e7dd9a10	03-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Mark sun_noname as const since it's immutable. Update definitions of functions that potentially accept &sun_noname (sbappendaddr(), et al) to accept a const sockaddr pointer.
# 2658b3bb	01-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.
# 302b4501	01-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Replace current locking comments for struct socket/struct sockbuf with new ones. Annotate constant-after-creation fields as such. The comments describe a number of locks that are not yet merged.
# 36568179	31-May-2004	Robert Watson <rwatson@FreeBSD.org>	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.
# 82c6e879	06-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 746e5bf0	29-Feb-2004	Robert Watson <rwatson@FreeBSD.org>	Rename dup_sockaddr() to sodupsockaddr() for consistency with other functions in kern_socket.c. Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT in from the caller context rather than "1" or "0". Correct mflags pass into mac_init_socket() from previous commit to not include M_ZERO. Submitted by: sam
# 2bc87dcf	29-Feb-2004	Robert Watson <rwatson@FreeBSD.org>	Modify soalloc() API so that it accepts a malloc flags argument rather than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT rather than 0 or 1. This simplifies the soalloc() logic, and also makes the waiting behavior of soalloc() more clear in the calling context. Submitted by: sam
# e45db9b8	15-Nov-2003	Alan Cox <alc@FreeBSD.org>	- Modify alpha's sf_buf implementation to use the direct virtual-to- physical mapping. - Move the sf_buf API to its own header file; make struct sf_buf's definition machine dependent. In this commit, we remove an unnecessary field from struct sf_buf on the alpha, amd64, and ia64. Ultimately, we may eliminate struct sf_buf on those architecures except as an opaque pointer that references a vm page.
# eca8a663	11-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	Modify the MAC Framework so that instead of embedding a (struct label) in various kernel objects to represent security data, we embed a (struct label *) pointer, which now references labels allocated using a UMA zone (mac_label.c). This allows the size and shape of struct label to be varied without changing the size and shape of these kernel objects, which become part of the frozen ABI with 5-STABLE. This opens the door for boot-time selection of the number of label slots, and hence changes to the bound on the number of simultaneous labeled policies at boot-time instead of compile-time. This also makes it easier to embed label references in new objects as required for locking/caching with fine-grained network stack locking, such as inpcb structures. This change also moves us further in the direction of hiding the structure of kernel objects from MAC policy modules, not to mention dramatically reducing the number of '&' symbols appearing in both the MAC Framework and MAC policy modules, and improving readability. While this results in minimal performance change with MAC enabled, it will observably shrink the size of a number of critical kernel data structures for the !MAC case, and should have a small (but measurable) performance benefit (i.e., struct vnode, struct socket) do to memory conservation and reduced cost of zeroing memory. NOTE: Users of MAC must recompile their kernel and all MAC modules as a result of this change. Because this is an API change, third party MAC modules will also need to be updated to make less use of the '&' symbol. Suggestions from: bmilekic Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 395bb186	27-Oct-2003	Sam Leffler <sam@FreeBSD.org>	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)
# cc342686	04-Aug-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Make the second argument to sooptcopyout() constant in order to simplify the upcoming PIM patches. Submitted by: Pavlin Radoslavov <pavlin@icir.org>
# 4e19fe10	17-Jul-2003	Robert Drehmel <robert@FreeBSD.org>	To avoid a kernel panic provoked by a NULL pointer dereference, do not clear the `sb_sel' member of the sockbuf structure while invalidating the receive sockbuf in sorflush(), called from soshutdown(). The panic was reproduceable from user land by attaching a knote with EVFILT_READ filters to a socket, disabling further reads from it using shutdown(2), and then closing it. knote_remove() was called to remove all knotes from the socket file descriptor by detaching each using its associated filterops' detach call- back function, sordetach() in this case, which tried to remove itself from the invalidated sockbuf's klist (sb_sel.si_note). PR: kern/54331
# 9f6d45b1	28-Mar-2003	Alan Cox <alc@FreeBSD.org>	Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it. The ultimate reason for this change is to enable two optimizations: (1) that there never be more than one sf_buf mapping a vm_page at a time and (2) 64-bit architectures can transparently use their 1-1 virtual to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter() and pmap_qremove().
# 521f364b	02-Mar-2003	Dag-Erling Smørgrav <des@FreeBSD.org>	More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# b3f1af6b	11-Jan-2003	Tim J. Robbins <tjr@FreeBSD.org>	Don't count mbufs with m_type == MT_HEADER or MT_OOBDATA as control data in sballoc(), sbcompress(), sbdrop() and sbfree(). Fixes fstat() st_size reporting and kevent() EVFILT_READ on TCP sockets.
# 08c7670a	23-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Move the declaration of the socket fileops from socketvar.h to file.h. This allows us to use the new typedefs and removes the needs for a number of forward struct declarations in socketvar.h
# 6ce9c72c	23-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	s/sokqfilter/soo_kqfilter/ for consistency with the naming of all other socket/file operations.
# 5ee0a409	01-Nov-2002	Alan Cox <alc@FreeBSD.org>	Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing a panic because the socket's state isn't as expected by sofree(). Discussed with: dillon, fenner
# e0f640e8	01-Nov-2002	Kelly Yancey <kbyanc@FreeBSD.org>	Track the number of non-data chararacters stored in socket buffers so that the data value returned by kevent()'s EVFILT_READ filter on non-TCP sockets accurately reflects the amount of data that can be read from the sockets by applications. PR: 30634 Reviewed by: -net, -arch Sponsored by: NTT Multimedia Communications Labs MFC after: 2 weeks
# d49fa1ca	16-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	In continuation of early fileop credential changes, modify fo_ioctl() to accept an 'active_cred' argument reflecting the credential of the thread initiating the ioctl operation. - Change fo_ioctl() to accept active_cred; change consumers of the fo_ioctl() interface to generally pass active_cred from td->td_ucred. - In fifofs, initialize filetmp.f_cred to ap->a_cred so that the invocations of soo_ioctl() are provided access to the calling f_cred. Pass ap->a_td->td_ucred as the active_cred, but note that this is required because we don't yet distinguish file_cred and active_cred in invoking VOP's. - Update kqueue_ioctl() for its new argument. - Update pipe_ioctl() for its new argument, pass active_cred rather than td_ucred to MAC for authorization. - Update soo_ioctl() for its new argument. - Update vn_ioctl() for its new argument, use active_cred rather than td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR(). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# ea6027a8	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Make similar changes to fo_stat() and fo_poll() as made earlier to fo_read() and fo_write(): explicitly use the cred argument to fo_poll() as "active_cred" using the passed file descriptor's f_cred reference to provide access to the file credential. Add an active_cred argument to fo_stat() so that implementers have access to the active credential as well as the file credential. Generally modify callers of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which was redundantly provided via the fp argument. This set of modifications also permits threads to perform these operations on behalf of another thread without modifying their credential. Trickle this change down into fo_stat/poll() implementations: - badfo_poll(), badfo_stat(): modify/add arguments. - kqueue_poll(), kqueue_stat(): modify arguments. - pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to MAC checks rather than td->td_ucred. - soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather than cred to pru_sopoll() to maintain current semantics. - sopoll(): moidfy arguments. - vn_poll(), vn_statfile(): modify/add arguments, pass new arguments to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL() to maintian current semantics. - vn_close(): rename cred to file_cred to reflect reality while I'm here. - vn_stat(): Add active_cred and file_cred arguments to vn_stat() and consumers so that this distinction is maintained at the VFS as well as 'struct file' layer. Pass active_cred instead of td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics. - fifofs: modify the creation of a "filetemp" so that the file credential is properly initialized and can be used in the socket code if desired. Pass ap->a_td->td_ucred as the active credential to soo_poll(). If we teach the vnop interface about the distinction between file and active credentials, we would use the active credential here. Note that current inconsistent passing of active_cred vs. file_cred to VOP's is maintained. It's not clear why GETATTR would be authorized using active_cred while POLL would be authorized using file_cred at the file system level. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 9ca43589	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 01abbb42	13-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Move to a nested include of _label.h instead of mac.h in sys/sys/*.h (Most of the places where mac.h was recursively included from another kernel header file. net/netinet to follow.) Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs Suggested by: bde
# 9e63574e	13-Aug-2002	David Greenman <dg@FreeBSD.org>	Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h so that they can be seen by external callers.
# 781caa81	30-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Label socket IPC objects, permitting security features to be maintained at the granularity of the socket. Two labels are stored for each socket: the label of the socket itself, and a cached peer label permitting interogation of the remote endpoint. Since socket locking is not yet present in the base tree, these objects are not locked, but are assumed to follow the same semantics as other modifiable entries in the socket structure. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 4581821b	27-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Remote socheckproc(), which was removed when p_can*() was introduced ages ago. The prototype was missed. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# f824b518	23-Jul-2002	John Polstra <jdp@FreeBSD.org>	Widen struct sockbuf's sb_timeo member to int from short. With non-default but reasonable values of hz this member overflowed, breaking NFS over UDP. Also, as long as I'm plowing up struct sockbuf ... Change certain members from u_long/long to u_int/int in order to reduce wasted space on 64-bit machines. This change was requested by Andrew Gallatin. Netstat and systat need to be rebuilt. I am incrementing __FreeBSD_version in case any ports need to change.
# 7f05b035	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.
# 02a32cd2	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	change struct socket -> so_pcb from caddr_t to void *.
# 98cb733c	25-Jun-2002	Kenneth D. Merry <ken@FreeBSD.org>	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# 03e49181	18-Jun-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Remove so*_locked(), which were backed out by mistake.
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# 90535973	02-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Cleanup, quote: This leaves some vestiges of the old locking, including style bugs in it. I've only noticed anachronisms in socketvar.h so far (I've merged net* but not kern or all of sys). The patch also has old fixes for style bugs in accf stuff and namespace pollution in uma... The largest style bugs are line continued backslashes in column 80 and (these are fixed), and starting the do-while code for the new macros in column 40, which is quite unlike the usual indentation (see sys/queue.h) and not even like the indentation for the old macros (column 32) (this is not fixed). Submitted by: bde
# f1320723	01-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.
# 960ed29c	29-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Revert the change of #includes in sys/filedesc.h and sys/socketvar.h. Requested by: bde Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h. While I am here, sort include files alphabetically, where possible.
# d48d4b25	27-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Add a global sx sigio_lock to protect the pointer to the sigio object of a socket. This avoids lock order reversal caused by locking a process in pgsigio(). sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now require sigio_lock to be locked. Provide sowwakeup_locked(), soisconnected_locked(), and so on in case where we have to modify a socket and wake up a process atomically.
# c473d3e4	23-Apr-2002	Mike Silbersack <silby@FreeBSD.org>	Remove sodropablereq - this function hasn't been used since the syncache went in. MFC after: 3 days
# 20504246	07-Apr-2002	Jeffrey Hsu <hsu@FreeBSD.org>	There's only one socket zone so we don't need to remember it in every socket structure.
# cb331c71	25-Mar-2002	Bruce Evans <bde@FreeBSD.org>	Removed some namespace pollution (unnecessary nested includes).
# c58eb46e	23-Mar-2002	Bruce Evans <bde@FreeBSD.org>	Fixed some style bugs in the removal of __P(()). The main ones were not removing tabs before "__P((", and not outdenting continuation lines to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting and/or rewrap the whole prototype in some cases.
# 54d77689	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Backout part of my previous commit; I was wrong about vm_zone's handling of limits on zones w/o objects.
# c897b813	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 789f12fe	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P
# 8355f576	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# 59047cca	13-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	Add parens around macro args. Forgotten by: dillon
# 47927c19	08-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	holdsock is gone, remove the prototype
# 9c4d63da	31-Dec-2001	Robert Watson <rwatson@FreeBSD.org>	o Make the credential used by socreate() an explicit argument to socreate(), rather than getting it implicitly from the thread argument. o Make NFS cache the credential provided at mount-time, and use the cached credential (nfsmount->nm_cred) when making calls to socreate() on initially connecting, or reconnecting the socket. This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well as bugs involving NFS and mandatory access control implementations. Reviewed by: freebsd-arch
# 63b42c19	13-Dec-2001	Brian Feldman <green@FreeBSD.org>	Remove stale prototype for sonewconn3().
# b1e4abd2	16-Nov-2001	Matthew Dillon <dillon@FreeBSD.org>	Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.
# f8bf16fc	24-Oct-2001	Robert Watson <rwatson@FreeBSD.org>	o Remove extern showallsockets, defunct as of the change to kern.security.seeotheruids_permitted. This was missed in the commit that made this change elsewhere.
# 4787fd37	05-Oct-2001	Paul Saab <ps@FreeBSD.org>	Only allow users to see their own socket connections if kern.ipc.showallsockets is set to 0. Submitted by: billf (with modifications by me) Inspired by: Dave McKay (aka pm aka Packet Magnet) Reviewed by: peter MFC after: 2 weeks
# 2af8d76d	13-Sep-2001	David E. O'Brien <obrien@FreeBSD.org>	Re-apply rev 1.178 -- style(9) the structure definitions. I have to wonder how many other changes were lost in the KSE mildstone 2 merge.
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 5752bffd	04-Sep-2001	David E. O'Brien <obrien@FreeBSD.org>	style(9) the structure definitions.
# 255a0181	31-Aug-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Whitespace change.
# 83a1e729	27-Jun-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Correct comment: so_q -> so_comp, so_q0 -> so_incomp. Submitted by: Adagio Vangogh <adagio_v@pacbell.net>
# 608a3ce6	15-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter
# 0a2c3d48	08-Jan-2001	Garrett Wollman <wollman@FreeBSD.org>	select() DKI is now in <sys/selinfo.h>.
# 49851cc7	31-Dec-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Use macro API to <sys/queue.h>
# 279d7226	18-Nov-2000	Matthew Dillon <dillon@FreeBSD.org>	This patchset fixes a large number of file descriptor race conditions. Pre-rfork code assumed inherent locking of a process's file descriptor array. However, with the advent of rfork() the file descriptor table could be shared between processes. This patch closes over a dozen serious race conditions related to one thread manipulating the table (e.g. closing or dup()ing a descriptor) while another is blocked in an open(), close(), fcntl(), read(), write(), etc... PR: kern/11629 Discussed with: Alexander Viro <viro@math.psu.edu>
# 34b94e8b	06-Sep-2000	Alfred Perlstein <alfred@FreeBSD.org>	Accept filter maintainance Update copyrights. Introduce a new sysctl node: net.inet.accf Although acceptfilters need refcounting to be properly (safely) unloaded as a temporary hack allow them to be unloaded if the sysctl net.inet.accf.unloadable is set, this is really for developers who want to work on thier own filters. A near complete re-write of the accf_http filter: 1) Parse check if the request is HTTP/1.0 or HTTP/1.1 if not dump to the application. Because of the performance implications of this there is a sysctl 'net.inet.accf.http.parsehttpversion' that when set to non-zero parses the HTTP version. The default is to parse the version. 2) Check if a socket has filled and dump to the listener 3) optimize the way that mbuf boundries are handled using some voodoo 4) even though you'd expect accept filters to only be used on TCP connections that don't use m_nextpkt I've fixed the accept filter for socket connections that use this. This rewrite of accf_http should allow someone to use them and maintain full HTTP compliance as long as net.inet.accf.http.parsehttpversion is set.
# a5c4836d	19-Aug-2000	David Malone <dwmalone@FreeBSD.org>	Replace the mbuf external reference counting code with something that should be better. The old code counted references to mbuf clusters by using the offset of the cluster from the start of memory allocated for mbufs and clusters as an index into an array of chars, which did the reference counting. If the external storage was not a cluster then reference counting had to be done by the code using that external storage. NetBSD's system of linked lists of mbufs was cosidered, but Alfred felt it would have locking issues when the kernel was made more SMP friendly. The system implimented uses a pool of unions to track external storage. The union contains an int for counting the references and a pointer for forming a free list. The reference counts are incremented and decremented atomically and so should be SMP friendly. This system can track reference counts for any sort of external storage. Access to the reference counting stuff is now through macros defined in mbuf.h, so it should be easier to make changes to the system in the future. The possibility of storing the reference count in one of the referencing mbufs was considered, but was rejected 'cos it would often leave extra mbufs allocated. Storing the reference count in the cluster was also considered, but because the external storage may not be a cluster this isn't an option. The size of the pool of reference counters is available in the stats provided by "netstat -m". PR: 19866 Submitted by: Bosko Milekic <bmilekic@dsuper.net> Reviewed by: alfred (glanced at by others on -net)
# a79b7128	19-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	return of the accept filter part II accept filters are now loadable as well as able to be compiled into the kernel. two accept filters are provided, one that returns sockets when data arrives the other when an http request is completed (doesn't work with 0.9 requests) Reviewed by: jmg
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# cb679c38	16-Apr-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce kqueue() and kevent(), a kernel event notification facility.
# bfbbc4aa	13-Jan-2000	Jason Evans <jasone@FreeBSD.org>	Add aio_waitcomplete(). Make aio work correctly for socket descriptors. Make gratuitous style(9) fixes (me, not the submitter) to make the aio code more readable. PR: kern/12053 Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>
# 664a31e4	28-Dec-1999	Peter Wemm <peter@FreeBSD.org>	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
# 82cd038d	21-Nov-1999	Yoshinobu Inoue <shin@FreeBSD.org>	KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP for IPv6 yet) With this patch, you can assigne IPv6 addr automatically, and can reply to IPv6 ping. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 0ba80ba6	07-Nov-1999	Peter Wemm <peter@FreeBSD.org>	Update socket file type for fo_stat(). soo_stat() becomes a fileops switch entry point rather than being used externally with knowledge of the internals of the DTYPE_SOCKET f_data contents.
# ecf72308	09-Oct-1999	Brian Feldman <green@FreeBSD.org>	Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
# 13ccadd4	19-Sep-1999	Brian Feldman <green@FreeBSD.org>	This is what was "fdfix2.patch," a fix for fd sharing. It's pretty far-reaching in fd-land, so you'll want to consult the code for changes. The biggest change is that now, you don't use fp->f_ops->fo_foo(fp, bar) but instead fo_foo(fp, bar), which increments and decrements the fp refcount upon entry and exit. Two new calls, fhold() and fdrop(), are provided. Each does what it seems like it should, and if fdrop() brings the refcount to zero, the fd is freed as well. Thanks to peter ("to hell with it, it looks ok to me.") for his review. Thanks to msmith for keeping me from putting locks everywhere :) Reviewed by: peter
# 2f9a2132	18-Sep-1999	Brian Feldman <green@FreeBSD.org>	Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually. Make a sonewconn3() which takes an extra argument (proc) so new sockets created with sonewconn() from a user's system call get the correct credentials, not just the parent's credentials.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# f29be021	17-Jun-1999	Brian Feldman <green@FreeBSD.org>	Reviewed by: the cast of thousands This is the change to struct sockets that gets rid of so_uid and replaces it with a much more useful struct pcred *so_cred. This is here to be able to do socket-level credential checks (i.e. IPFW uid/gid support, to be added to HEAD soon). Along with this comes an update to pidentd which greatly simplifies the code necessary to get a uid from a socket. Soon to come: a sysctl() interface to finding individual sockets' credentials.
# 8fe387ab	04-Apr-1999	Dmitrij Tejblum <dt@FreeBSD.org>	Add standard padding argument to pread and pwrite syscall. That should make them NetBSD compatible. Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that the offset is set in the struct uio). Factor out some common code from read/pread/write/pwrite syscalls.
# 9cbac9ce	01-Feb-1999	Mark Newton <newton@FreeBSD.org>	Moved prototypes for soo_{read,write,close} into socketvar.h where they belong. Suggested by: bde
# 4c253324	31-Jan-1999	Bruce Evans <bde@FreeBSD.org>	Fixed smashed tabs and inconstent comment style in previous commit.
# f8b03d85	29-Jan-1999	Mark Newton <newton@FreeBSD.org>	Changed struct socket to include a new field (at the end, so as not to break existing software) acting as a pointer to emulator-specific state data that some emulators may (or may not) need to maintain about a socket. Used by the svr4 module as a place for maintaining STREAMS emulation state. Discussed with: Mike Smith, Garrett Wollman back in Sept 98
# 527b7a14	25-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Port NetBSD's 19990120-accept bug fix. This works around the race condition where select(2) can return that a listening socket has a connected socket queued, the connection is broken, and the user calls accept(2), which then blocks because there are no connections queued. Reviewed by: wollman Obtained from: NetBSD (ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)
# 62d6ce3a	11-Nov-1998	Don Lewis <truckman@FreeBSD.org>	I got another batch of suggestions for cosmetic changes from bde.
# 831d27a9	11-Nov-1998	Don Lewis <truckman@FreeBSD.org>	Installed the second patch attached to kern/7899 with some changes suggested by bde, a few other tweaks to get the patch to apply cleanly again and some improvements to the comments. This change closes some fairly minor security holes associated with F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN had on tty devices. For more details, see the description on the PR. Because this patch increases the size of the proc and pgrp structures, it is necessary to re-install the includes and recompile libkvm, the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w. PR: kern/7899 Reviewed by: bde, elvind
# dd0b2081	05-Nov-1998	David Greenman <dg@FreeBSD.org>	Implemented zero-copy TCP/IP extensions via sendfile(2) - send a file to a stream socket. sendfile(2) is similar to implementations in HP-UX, Linux, and other systems, but the API is more extensive and addresses many of the complaints that the Apache Group and others have had with those other implementations. Thanks to Marc Slemko of the Apache Group for helping me work out the best API for this. Anyway, this has the "net" result of speeding up sends of files over TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU cycles) when compared to a traditional read/write loop.
# cfe8b629	22-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# ecbb00a2	07-Jun-1998	Doug Rabson <dfr@FreeBSD.org>	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# 47147132	31-May-1998	Peter Wemm <peter@FreeBSD.org>	Have the sorwakeup and sowwakeup check the upcall flags. Obtained from: NetBSD
# 98271db4	15-May-1998	Garrett Wollman <wollman@FreeBSD.org>	Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
# 4049a042	01-Mar-1998	Guido van Rooij <guido@FreeBSD.org>	Make sure that you can only bind a more specific address when it is done by the same uid. Obtained from: OpenBSD
# 8bcc577e	01-Feb-1998	Bruce Evans <bde@FreeBSD.org>	Forward declare more structs that are used in prototypes here - don't depend on <sys/types.h> forward declaring common ones.
# cb3453e8	21-Dec-1997	Bruce Evans <bde@FreeBSD.org>	Moved some declarations from <sys/socket.h> to the correct places, and fixed everything that depended on them being misplaced.
# 3a74593f	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Update interfaces for poll()
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 548af278	19-Jul-1997	Bill Fenner <fenner@FreeBSD.org>	Remove sonewconn() macro kludge, introduced in 4.3-Reno to catch argument mismatches. Prototypes do a much better job these days. Noticed by: bde
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 0701aeed	12-Nov-1996	Bruce Evans <bde@FreeBSD.org>	Added missing prototype for new function sbcreatecontrol(). Should be in 2.2.
# ebb0cbea	06-Oct-1996	Paul Traina <pst@FreeBSD.org>	Increase robustness of FreeBSD against high-rate connection attempt denial of service attacks. Reviewed by: bde,wollman,olah Inspired by: vjs@sgi.com
# 44d5b3c8	30-Apr-1996	Bruce Evans <bde@FreeBSD.org>	Made this self-sufficent (apart from <sys/types.h>) again. It included <sys/stat.h> and <sys/filedesc.h> just to get struct tags and depended on a previous #include for <sys/queue.h>
# 02e2c406	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change. [new sys/syscallargs.h file, to be "cvs rm"ed]
# be24e9e8	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Changed socket code to use 4.4BSD queue macros. This includes removing the obsolete soqinsque and soqremque functions as well as collapsing so_q0len and so_qlen into a single queue length of unaccepted connections. Now the queue of unaccepted & complete connections is checked directly for queued sockets. The new code should be functionally equivilent to the old while being substantially faster - especially in cases where large numbers of connections are often queued for accept (e.g. http).
# 3b3cc59f	10-Mar-1996	Jeffrey Hsu <hsu@FreeBSD.org>	Merge in Lite2: clean up function prototypes. Did not accept change of second argument to ioctl from int to u_long. Reviewed by: davidg & bde
# dc915e7c	13-Feb-1996	Garrett Wollman <wollman@FreeBSD.org>	Kill XNS. While we're at it, fix socreate() to take a process argument. (This was supposed to get committed days ago...)
# 6c5e9bbd	30-Jan-1996	Mike Pritchard <mpp@FreeBSD.org>	Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
# 47daf5d5	14-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Nuked ambiguous sleep message strings: old: new: netcls[] = "netcls" "soclos" netcon[] = "netcon" "accept", "connec" netio[] = "netio" "sblock", "sbwait"
# 87b6de2b	14-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	A Major staticize sweep. Generates a couple of warnings that I'll deal with later. A number of unused vars removed. A number of unused procs removed or #ifdefed.
# 512fef80	20-Nov-1995	Bruce Evans <bde@FreeBSD.org>	Completed function declarations and/or added prototypes.
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# f86eaaca	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Prototypes, prototypes and even more prototypes. Not quite done yet, but getting closer all the time.
# af9da405	20-Aug-1994	Paul Richards <paul@FreeBSD.org>	Made them all idempotent. Reviewed by: Submitted by:
# f23b4c91	18-Aug-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources