History log of /freebsd-current/sys/kern/uipc_syscalls.c
Revision Date Author Comments
# aa32d7cb 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record socket violations with KTR_CAPFAIL

Report restricted access to socket addresses and protocols while
Capsicum violation tracing with CAPFAIL_ADDR and CAPFAIL_PROTO.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40681


# 0cd9cde7 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record namei violations with KTR_CAPFAIL

Report namei path lookups while Capsicum violation tracing with
CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic
capability mode behavior.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40680


# 47ad4f2d 04-Mar-2024 Kyle Evans <kevans@FreeBSD.org>

ktrace: log genio events on failed write

Visibility into the contents of the buffer when a write(2) has failed
can be immensely useful in debugging IPC issues -- pushing this to
discuss the idea, or maybe an alternative where we can set a flag like
KTRFAC_ERRIO to enable it.

When a genio event is potentially raised after an error, currently we'll
just free the uio and return. However, such data can be useful when
debugging communication between processes to, e.g., understand what the
remote side should have grabbed before closing a pipe. Tap out the
entire buffer on failure rather than simply discarding it.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D43799


# f79a8585 30-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: garbage collect SS_ISCONFIRMING

Fixes: 8df32b19dee92b5eaa4b488ae78dca6accfcb38e


# c3276e02 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make shutdown(2) how argument a enum

Reviwed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43412


# 0fac350c 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on getpeername/getsockname

Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.

Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694


# cfb1e929 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on accept(2)

Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.

While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.

Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 761ae1ce 16-Oct-2023 Mark Johnston <markj@FreeBSD.org>

ktrace: Handle uio_resid underflow via MSG_TRUNC

When recvmsg(2) is used with MSG_TRUNC on an atomic socket type (DGRAM
or SEQPACKET), soreceive_generic() and uipc_peek_dgram() may
intentionally underflow uio_resid so that userspace can find out how
many bytes it should have asked for.

If this happens, and KTR_GENIO is enabled, ktrgenio() will attempt to
copy in beyond the end of the output buffer's iovec. In general this
will silently cause the ktrace operation to fail since it'll result in
EFAULT from uiomove(). Let's be more careful and make sure not to try
and copy more bytes than we have.

Fixes: be1f485d7d6b ("sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().")
Reported by: syzbot+30b4bb0c0bc0f53ac198@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42099


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 6016aedb 12-Jun-2023 Dmitriy Alexandrov <d06alexandrov@users.noreply.github.com>

uipc_syscalls: removed unnecessary check in accept1() function

Signed-off-by: Dmitriy Alexandrov <d06alexandrov@gmail.com>
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/773


# 00343b4a 13-Feb-2023 Mateusz Guzik <mjg@FreeBSD.org>

uipc: ansify

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 7a2c93b8 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: provide sousrsend() that does socket specific error handling

Sockets have special handling for EPIPE on a write, that was spread out
into several places. Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31e).

- Provide sousrsend() that expects a valid uio, and leave sosend() for
kernel consumers only. Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
error handling for kern_sendit(), soo_write() and soaio_process_job().

PR: 265087
Reported by: rz-rpi03 at h-ka.de
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35863


# 1760a695 10-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

Fixup build after recent getsock changes


# 3be2225f 10-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

Remove fflag argument from getsock_cap

Interested callers can obtain in other own easily enough
and there is no reason to branch on it.


# 3212ad15 07-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

Add getsock

All but one consumers of getsock_cap only pass 4 arguments.
Take advantage of it.


# a2ad7092 10-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

Add branch prediction hints to getsock_cap


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 31d1b816 28-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

sysent: Get rid of bogus sys/sysent.h include.

Where appropriate hide sysent.h under proper condition.

MFC after: 2 weeks


# d60ea9a1 25-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: return EMSGSIZE if control part of message is too large

Specification doesn't list an explicit error code for the control
size specified by msg_control being too large. But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)". It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t." Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35316


# d2b3a0ed 17-Feb-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sendto: don't clear transient errors for atomic protocols

The changeset 65572cade35 uncovered the fact that top layer of sendto(2)
would clear a transient error code if some data was copied out of uio.
The clearing of the error makes sense for non-atomic protocols, since
they have sent some data. The atomic protocols send all or nothing.

The current implementation of unix/dgram uses sosend_generic(), which
would always copyout and only then it may fail to deliver a message.
The sosend_dgram(), currently used by UDP only, also has same behavior.

Reported by: pho
Reviewed by: pho, markj
Differential revision: https://reviews.freebsd.org/D34309


# 308fc7e5 24-Jan-2022 John Baldwin <jhb@FreeBSD.org>

user_getpeername: Use 'bool' for the compat argument.

This matches user_getsockname.

Reviewed by: brooks, kib
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33987


# ba4e5253 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: normalize orecvfrom and ogetsockname

Declare o<foo>_args rather than reusing the equivalent <foo>_args
structs. Avoiding the addition of a new type isn't worth the
gratutious differences.

Reviewed by: kib, imp


# 28f04718 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

uipc: rework recvfrom, getsockname, getpeername

Stop using <foo>_args structs as part of internal kernel APIs. Add
a kern_recvfrom and adjust getsockname and getpeername's equivalent
functions to take individual arguments rather than a uap pointer.

Adopt a convention from CheriBSD that a function interacting with
userspace pointers and sitting between the sys_<foo> syscall and
kern_<foo> implementation is named user_<foo>.

Reviewed by: kib, imp


# a8aa6f1f 07-Sep-2021 Mark Johnston <markj@FreeBSD.org>

socket: Avoid clearing SS_ISCONNECTING if soconnect() fails

This behaviour appears to date from the 4.4 BSD import. It has two
problems:

1. The update to so_state is not protected by the socket lock, so
concurrent updates to so_state may be lost.
2. Suppose two threads race to call connect(2) on a socket, and one
succeeds while the other fails. Then the failing thread may
incorrectly clear SS_ISCONNECTING, confusing the state machine.

Simply remove the update. It does not appear to be necessary:
pru_connect implementations which call soisconnecting() only do so after
all failure modes have been handled. For instance, tcp_connect() and
tcp6_connect() will never return an error after calling soisconnected().
However, we cannot correctly assert that SS_ISCONNECTED is not set after
an error from soconnect() since the socket lock is not held across the
pru_connect call, so a concurrent connect(2) may have set the flag.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31699


# 091869de 27-Aug-2021 Mark Johnston <markj@FreeBSD.org>

connect: Use soconnectat() unconditionally in kern_connect()

soconnect(...) is equivalent to soconnectat(AT_FDCWD, ...), so rely on
this to save a branch. No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# f4bb1869 14-Jun-2021 Mark Johnston <markj@FreeBSD.org>

Consistently use the SOLISTENING() macro

Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# f187d6df 15-Mar-2021 Kyle Evans <kevans@FreeBSD.org>

base: remove if_wg(4) and associated utilities, manpage

After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree. This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.


# 74ae3f3e 14-Mar-2021 Kyle Evans <kevans@FreeBSD.org>

if_wg: import latest fixup work from the wireguard-freebsd project

This is the culmination of about a week of work from three developers to
fix a number of functional and security issues. This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
the interface's home vnet so that it can act as the sole network
connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
complete. It is additionally supported by the upstream
wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib. iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations. This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after: 1 month (maybe)


# 7e097daa 11-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Only enable COMPAT_43 changes for syscalls ABI for a.out processes.

Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21200


# 401ca034 05-Feb-2019 Mark Johnston <markj@FreeBSD.org>

Avoid leaking fp references when truncating SCM_RIGHTS control messages.

Reported by: pho
Approved by: so
MFC after: 0 minutes
Security: CVE-2019-5596
Sponsored by: The FreeBSD Foundation


# d48719bd 04-Dec-2018 Brooks Davis <brooks@FreeBSD.org>

Normalize COMPAT_43 syscall declarations.

Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.

No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15816


# 318f0d77 06-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Use declared types for caddr_t arguments.

Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852


# c7902fbe 07-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Improve handling of control message truncation.

If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages. In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table. Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient. To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
message boundary, e.g., if the caller received two control messages
and provided only the exact amount of buffer space needed for the
first.

PR: 131876
Reviewed by: ed (previous version)
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16561


# da446550 02-Aug-2018 Alan Somers <asomers@FreeBSD.org>

Fix LOCAL_PEERCRED with socketpair(2)

Enable the LOCAL_PEERCRED socket option for unix domain stream sockets
created with socketpair(2). Previously, it only worked with unix domain
stream sockets created with socket(2)/listen(2)/connect(2)/accept(2).

PR: 176419
Reported by: Nicholas Wilson <nicholas@nicholaswilson.me.uk>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16350


# 8a656309 22-May-2018 Matt Macy <mmacy@FreeBSD.org>

kern_sendit: use pre-initialized rights


# cbd92ce6 09-May-2018 Matt Macy <mmacy@FreeBSD.org>

Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month


# 2216c693 30-Apr-2018 Ed Maste <emaste@FreeBSD.org>

Disable connectat/bindat with AT_FDCWD in capmode

Previously it was possible to connect a socket (which had the
CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in
capabilties mode. This combination should be treated the same as a call
to connect (i.e. forbidden in capabilities mode). Similarly for bindat.

Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the
documentation and add tests.

PR: 222632
Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by: Domagoj Stolfa
MFC after: 1 week
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D15221


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 34a77b97 27-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Move uio enums to sys/_uio.h.

Include _uio.h instead of uio.h in several headers to reduce header
polution.

Fix a few places that relied on header polution to get the uio.h header.

I have not moved struct uio as many more things that use it rely on
header polution to get other definitions from uio.h.

Reviewed by: cem, kib, markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14811


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 779f106a 08-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Listening sockets improvements.

o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.

o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770


# d293f35c 29-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_listen(), kern_shutdown(), and kern_socket(), and use them
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.

Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9367


# 9d71a397 21-Oct-2016 Hiren Panchasara <hiren@FreeBSD.org>

Rework r306337.

In sendit(), if mp->msg_control is present, then in sockargs() we are
allocating mbuf to store mp->msg_control. Later in kern_sendit(), call
to getsock_cap(), will check validity of file pointer passed, if this
fails EBADF is returned but mbuf allocated in sockargs() is not freed.
Made code changes to free the same.

Since freeing control mbuf in sendit() after checking (control != NULL)
may lead to double freeing of control mbuf in sendit(), we can free
control mbuf in kern_sendit() if there are any errors in the routine.

Submitted by: Lohith Bellad <lohith.bellad@me.com>
Reviewed by: glebius
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D8152


# 7c9a4d09 26-Sep-2016 Hiren Panchasara <hiren@FreeBSD.org>

Revert r306337. dhw@ reproted a panic which seems related to this and bde@ has
raised some issues.


# 41bb1a25 26-Sep-2016 Hiren Panchasara <hiren@FreeBSD.org>

In sendit(), if mp->msg_control is present, then in sockargs() we are allocating
mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(),
will check validity of file pointer passed, if this fails EBADF is returned but
mbuf allocated in sockargs() is not freed. Fix this possible leak.

Submitted by: Lohith Bellad <lohith.bellad@me.com>
Reviewed by: adrian
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D7910


# 85b0f9de 22-Sep-2016 Mariusz Zaborski <oshogbo@FreeBSD.org>

capsicum: propagate rights on accept(2)

Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# 2e4fd101 10-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix build


# 82b3cec5 09-Sep-2016 Ed Maste <emaste@FreeBSD.org>

ANSIfy uipc_syscalls.c

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7839


# e9877429 18-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

The SA-16:19 wouldn't have happened if the sockargs() had properly typed
argument for length. While here make it static and convert to ANSI C.

Reviewed by: C Turt


# 7349ea78 17-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Validate that user supplied control message length is not negative.

Submitted by: C Turt <cturt hardenedbsd.org>
Security: SA-16:19
Security: CVE-2016-1887


# b85f65af 15-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

kern: for pointers replace 0 with NULL.

These are mostly cosmetical, no functional change.

Found with devel/coccinelle.


# 33a2a37b 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Separate sendfile(2) implementation from uipc_syscalls.c into
separate file. Claim my copyright.
- Provide more comments, better function and structure names.
- Sort out unneeded includes from resulting two files.

No functional changes.


# 2bab0c55 08-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and
up to now.

The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.

Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.

The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.

Celebrates: Netflix global launch!
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
Relnotes: yes


# b0cd2017 16-Dec-2015 Gleb Smirnoff <glebius@FreeBSD.org>

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# b114aa79 27-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Make shutdown() return ENOTCONN as required by POSIX, part deux.

Summary:
Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right.

Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries.

I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same.

Test Plan:
This issue was uncovered while writing tests for shutdown() in CloudABI:

https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26

Reviewers: glebius, rwatson, #manpages, gnn, #network

Reviewed By: gnn, #network

Subscribers: bms, mjg, imp

Differential Revision: https://reviews.freebsd.org/D3039


# 093c7f39 12-Jun-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 25742185 11-Apr-2015 Mateusz Guzik <mjg@FreeBSD.org>

Replace struct filedesc argument in getsock_cap with struct thread

This is is a step towards removal of spurious arguments.


# 90f54cbf 11-Apr-2015 Mateusz Guzik <mjg@FreeBSD.org>

fd: remove filedesc argument from fdclose

Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.


# 3d597295 28-Feb-2015 Ryan Stone <rstone@FreeBSD.org>

Correct the use of an unitialized variable in sendfind_getobj()

When sendfile_getobj() is called on a DTYPE_SHM file, it never
initializes error, which is eventually returned to the caller.

Differential Revision: https://reviews.freebsd.org/D1989
Reviewed by: kib
Reported by: Brainy Code Scanner, by Maxime Villard.


# b7a39e9e 17-Feb-2015 Mateusz Guzik <mjg@FreeBSD.org>

filedesc: simplify fget_unlocked & friends

Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.

Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.

Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.

Reviewed by: silence on -hackers


# 6e646651 13-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# efe28398 11-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix build.


# 0e87b36e 11-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove SF_KQUEUE code. This code was developed at Netflix, but was not
ever used. It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained. It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.

Sponsored by: Netflix


# 80b47aef 09-Oct-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
sys_sctp_peeloff
sys_sctp_generic_sendmsg
sys_sctp_generic_sendmsg_iov
sys_sctp_generic_recvmsg

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.


# 818d40d0 10-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Provide sf_buf_ref() to optimize refcounting of already allocated
sendfile(2) buffers.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 1fbe6a82 11-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Improve reference counting of EXT_SFBUF pages attached to mbufs.

o Do not use UMA refcount zone. The problem with this zone is that
several refcounting words (16 on amd64) share the same cache line,
and issueing atomic(9) updates on them creates cache line contention.
Also, allocating and freeing them is extra CPU cycles.
Instead, refcount the page directly via vm_page_wire() and the sfbuf
via sf_buf_alloc(sf_buf_page(sf)) [1].

o Call refcounting/freeing function for EXT_SFBUF via direct function
call, instead of function pointer. This removes barrier for CPU
branch predictor.

o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
copy m_ext in mb_dupcl().

[1] This has some problems for now. Using sf_buf_alloc() merely to
increase refcount is expensive, and is broken on sparc64. To be
fixed.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 15c28f87 11-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

All mbuf external free functions never fail, so let them be void.

Sponsored by: Nginx, Inc.


# 3ae10f74 16-Jun-2014 Attilio Rao <attilio@FreeBSD.org>

- Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 857ce8a2 11-May-2014 Jilles Tjoelker <jilles@FreeBSD.org>

accept(),accept4(): Don't set *addrlen = 0 on [ECONNABORTED].

If the underlying protocol reported an error (e.g. because a connection was
closed while waiting in the queue), this error was also indicated by
returning a zero-length address. For all other kinds of errors (e.g.
[EAGAIN], [ENFILE], [EMFILE]), *addrlen is unmodified and there are
successful cases where a zero-length address is returned (e.g. a connection
from an unbound Unix-domain socket), so this error indication is not
reliable.

As reported in Austin Group bug #836, modifying *addrlen on error may cause
subtle bugs if applications retry the call without resetting *addrlen.


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# 0cfea1c8 16-Jan-2014 Adrian Chadd <adrian@FreeBSD.org>

Implement a kqueue notification path for sendfile.

This fires off a kqueue note (of type sendfile) to the configured kqfd
when the sendfile transaction has completed and the relevant memory
backing the transaction is no longer in use by this transaction.
This is analogous to SF_SYNC waiting for the mbufs to complete -
except now you don't have to wait.

Both SF_SYNC and SF_KQUEUE should work together, even if it
doesn't necessarily make any practical sense.

This is designed for use by applications which use backing cache/store
files (eg Varnish) or POSIX shared memory (not sure anything is using
it yet!) to know when a region of memory is free for re-use. Note
it doesn't mark the region as free overall - only free from this
transaction. The application developer still needs to track which
ranges are in the process of being recycled and wait until all
pending transactions are completed.

TODO:

* documentation, as always

Sponsored by: Netflix, Inc.


# a43caef1 08-Jan-2014 Adrian Chadd <adrian@FreeBSD.org>

Refactor out the common sendfile code from the do_sendfile() and the
compat32 sendfile syscall.

Sponsored by: Netflix, Inc.


# dc3bdd4a 16-Dec-2013 Adrian Chadd <adrian@FreeBSD.org>

Remove the invariants stuff I copy/paste'd from the mbuf code when
setting up the UMA zone.

This should (a) be correct(er) and (b) it should build on non-amd64.

Pointed out by: glebius


# 73242a5e 16-Dec-2013 Adrian Chadd <adrian@FreeBSD.org>

Migrate the sendfile_sync struct to use a UMA zone rather than M_TEMP.

This allows it to be better tracked as well as being able to leverage
UMA for more interesting/useful behaviour at a later date.

Sponsored by: Netflix, Inc.


# ad4804a0 01-Dec-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove unused variable.


# 79750e3b 30-Nov-2013 Adrian Chadd <adrian@FreeBSD.org>

Migrate the sendfile_sync structure into a public(ish) API in preparation
for extending and reusing it.

The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.

So, with that in mind:

* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.

This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.

Tested:

* Peter Holm's sendfile stress / regression scripts

Sponsored by: Netflix, Inc.


# 3287361e 25-Nov-2013 Adrian Chadd <adrian@FreeBSD.org>

Refactor out the sendfile copyout in order to make vn_sendfile()
callable from the kernel.

Right now vn_sendfile() can't be called from anything other than
a syscall handler _and_ return the number of bytes queued.
This simply moves the copyout() to do_sendfile() so that any kernel
code can initiate vn_sendfile() outside of a syscall context.

Tested:

* tiny little sendfile program spitting things out a tcp socket

Sponsored by: Netflix, Inc.


# 44f3b9c7 21-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

Print more useful information about the transfer that trigger the assertion.
Other data is available with ddb command 'show pginfo'.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 255c1caa 22-Sep-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Create kern.ipc.sendfile namespace, and put the new "readhead" OID
there as "kern.ipc.sendfile.readahead".
- Push all nsfbuf related tunables into MD code. Don't move them
to new namespace in favor of POLA.

Reviewed by: scottl
Approved by: re (gjb)


# 85fdd534 17-Sep-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix assertion in sendfile_readpage() to assert only the validity
of requested amount of data in a page. Move assertion down below
object unlock.

Approved by: re (kib)
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 64c5de54 11-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix build with gcc.

Build-tested by: gjb
Approved by: re (glebius)


# 227aaa86 11-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement sendfile(2) for the posix shared memory segment file descriptor,
in addition to the regular files.

Requested by: alc
Discussed with: emaste
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Approved by: re (hrs)


# 1a05c762 10-Sep-2013 Dag-Erling Smørgrav <des@FreeBSD.org>

Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory. [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks. [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem. [SA-13:13]

Security: CVE-2013-5666
Security: FreeBSD-SA-13:11.sendfile
Security: CVE-2013-5691
Security: FreeBSD-SA-13:12.ifioctl
Security: CVE-2013-5710
Security: FreeBSD-SA-13:13.nullfs
Approved by: re


# 547561f1 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Style fixes. Most fixes are about not treating integers and pointers as
booleans.


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# bb25e5ab 25-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Give (*ext_free) an int return value allowing for very sophisticated
external mbuf buffer management capabilities in the future.

For now only EXT_FREE_OK is defined with current legacy behavior.

Sponsored by: The FreeBSD Foundation


# 9a736876 24-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Add an mbuf pointer parameter to (*ext_free) to give the external
free function access to the mbuf the external memory was attached
to.

Mechanically adjust all users to include the mbuf parameter.

This fixes a long standing annoyance for external free functions.
Before one had to sacrifice one of the argument pointers for this.

Sponsored by: The FreeBSD Foundation


# f6d76b0e 23-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Since the 253927, which removed the soft busy call for the sf page, it
does not make sense to wait for the soft busy state of the page to
drain. The vm object lock is dropped immediately after, so the result
of the wait is invalidated.

It might make sense to not wait for the hard busy state as well,
esp. for the fully valid page, but this is postponed for now.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 42d875a5 16-Aug-2013 Xin LI <delphij@FreeBSD.org>

Fix build.


# ca04d21d 15-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 90c35c19 13-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Minor style(9) fix.
- Bring a comment up to date.


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 878a7887 04-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

Remove unnecessary soft busy of the page before to do vn_rdwr() in
kern_sendfile() which is unnecessary.
The page is already wired so it will not be subjected to pagefault.
The content cannot be effectively protected as it is full of races
already.
Multiple accesses to the same indexes are serialized through vn_rdwr().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc, jeff
Tested by: pho


# fcd9ff2c 31-Jul-2013 Scott Long <scottl@FreeBSD.org>

Another fix for r253823; retain the default of 1 readahead block for sendfile.

Submitted by: glebius
Obtained from: Netflix
MFC after: 3 days


# fc4a5f05 30-Jul-2013 Scott Long <scottl@FreeBSD.org>

Create a knob, kern.ipc.sfreadahead, that allows one to tune the amount of
readahead that sendfile() will do. Default remains the same.

Obtained from: Netflix
MFC after: 3 days


# 05d1f5bc 15-Jul-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with: glebius


# 3328431d 09-May-2013 Konstantin Belousov <kib@FreeBSD.org>

Item 1 in r248830 causes earlier exits from the sendfile(2), before
all requested data was sent. The reason is that xfsize <= 0 condition
must not be tested at all if space == loopbytes. Otherwise, the done
is set to 1, and sendfile(2) is aborted too early.

Instead of moving the condition to exiting the inner loop after the
xfersize check, directly check for the completed transfer before the
testing of the available space in the socket buffer, and revert item 1
of r248830. It is arguably another bug to sleep waiting for socket
buffer space (or return EAGAIN for non-blocking socket) if all bytes
are already transferred.

Reported by: pho
Discussed with: scottl, gibbs
Tested by: scottl (stable/9 backport), pho


# da7d2afb 01-May-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Add accept4() system call.

The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.


# 3d317679 28-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Eliminate the layering violation in the kern_sendfile(). When quering
the file size, use VOP_GETATTR() instead of accessing vnode vm_object
un_pager.vnp.vnp_size.

Take the shared vnode lock earlier to cover the added VOP_GETATTR()
call and, as consequence, the whole internal sendfile loop. Reduce vm
object lock scope to not protect the local calculations.

Note that this is the last misuse of the vnp_size in the tree, the
others were removed from the ELF image activator by r230246.

Reviewed by: alc
Tested by: pho, bf (previous version)
MFC after: 1 week


# 14658a80 19-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Don't compare unsigned socklen_t against < 0.

Reviewed by: jhb


# 07dbf2c7 28-Mar-2013 Scott Long <scottl@FreeBSD.org>

Several fixes and improvements to sendfile()

1. If we wanted to send exactly as many bytes as the socket buffer is
sized for, the inner loop of kern_sendfile() would see that the
socket is full before seeing that it had no more bytes left to send.
This would cause it to return EAGAIN to the caller instead of
success. Fix by changing the order that these conditions are tested.
2. Simplify the calculation for the bytes to send in each iteration of
the inner loop of kern_sendfile()
3. Fix some calls with bogus arguments to sf_buf_ext(). These would
only trigger on mbuf allocation failure, but would be hilariously
bad if they did trigger.

Submitted by: gibbs(3), andre(2)
Reviewed by: emax, andre
Obtained from: Netflix
MFC after: 1 week


# c2e3c52e 19-Mar-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.

This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.

The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.

The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.

For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.

Reviewed by: kib


# 93cfe763 15-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Use m_get2() instead of hand allocating.
- No need for u_int cast here.

Sponsored by: Nginx, Inc.


# 3b4a84e7 11-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

In kern_sendfile() use m_extadd() instead of MEXTADD() macro, supplying
appropriate wait argument and checking return value. Before this change
m_extadd() could fail, and kern_sendfile() ignored that.

Sponsored by: Nginx, Inc.


# fbb34710 11-Mar-2013 Michael Tuexen <tuexen@FreeBSD.org>

Return an error if sctp_peeloff() fails because a socket can't be allocated.

MFC after: 3 days


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 7493f24e 02-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Implement two new system calls:

int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# fbda3d5d 06-Feb-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Audit sockaddr argument for bind(2), connect(2), accept(2), sendto(2) and
recvfrom(2) syscalls.

Sponsored by: The FreeBSD Foundation


# 82b316b3 06-Feb-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Minor style tweaks.


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 1c771f92 05-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


# f3cd9805 11-Jun-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Style fixes and simplifications.

MFC after: 1 month


# 3b5da8d6 08-Jun-2012 Mateusz Guzik <mjg@FreeBSD.org>

Plug socket refcount leak on error in sys_sctp_peeloff.

Reviewed by: tuexen
Approved by: trasz (mentor)
MFC after: 3 days


# 36eeafa0 04-Jun-2012 Gleb Smirnoff <glebius@FreeBSD.org>

style(9) for r236563.


# 8955d272 04-Jun-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Microoptimisation of code from r236560, also coming from Nginx Inc.

Submitted by: ru


# 835d8900 03-Jun-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Optimise kern_sendfile(): skip cycling through the entire mbuf chain in
m_cat(), storing pointer to last mbuf in chain in local variable and
attaching new mbuf to the end of chain.

Submitter reports that CPU load dropped for > 10% on a web server
serving large files with this optimisation.

Submitted by: Sergey Budnevitch <sb nginx.com>


# 99f293a2 15-Mar-2012 Michael Tuexen <tuexen@FreeBSD.org>

Fix bugs which can result in a panic when an non-SCTP socket it
used with an sctp_ system-call which expects an SCTP socket.

MFC after: 3 days.


# 526d0bd5 20-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# 12bc222e 30-Jun-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Add some checks to ensure that Capsicum is behaving correctly, and add some
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.

Approved by: mentor(rwatson), re(bz)


# c721b934 07-Jun-2011 John Baldwin <jhb@FreeBSD.org>

Log the socket address passed as the destination to sendto() and sendmsg()
via ktrace.

MFC after: 1 week


# 1fe80828 01-Apr-2011 Konstantin Belousov <kib@FreeBSD.org>

After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by: jhb
X-MFC-note: do not


# 1fb51a12 16-Feb-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks


# 8189ac85 03-Feb-2011 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary page hold_count checks. These checks predate
r90944, which introduced a general mechanism for handling the freeing
of held pages.

Reviewed by: kib@


# 9ca9fc53 28-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

If more than one thread allocated sf buffers for sendfile(2), and
each of the threads needs more while current pool of the buffers is
exhausted, then neither thread can make progress.

Switch to nowait allocations after we got first buffer already.

Reported by: az
Reviewed by: alc (previous version)
Tested by: pho
MFC after: 1 week


# b452cf63 13-Dec-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Just pass M_ZERO to malloc(9) instead of clearing allocated memory separately.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 049640c1 05-Sep-2010 Michael Tuexen <tuexen@FreeBSD.org>

Implement correct handling of address parameter and
sendinfo for SCTP send calls.

MFC after: 4 weeks.


# 2f9f22ae 05-Jul-2010 Michael Tuexen <tuexen@FreeBSD.org>

MFC r209624
* Do not dereference a NULL pointer when calling an SCTP send syscall
not providing a destination address and using ktrace.
* Do not copy out kernel memory when providing sinfo for sctp_recvmsg().
Both bugs where reported by Valentin Nechayev.
The first bug results in a kernel panic.
Approved by: re@


# 7a6f3d78 29-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Send SIGPIPE to the thread that issued the offending system call
rather than to the entire process.

Reported by: Anit Chakraborty
Reviewed by: kib, deischen (concept)
MFC after: 1 week


# e1c97831 26-Jun-2010 Michael Tuexen <tuexen@FreeBSD.org>

* Do not dereference a NULL pointer when calling an SCTP send syscall
not providing a destination address and using ktrace.
* Do not copy out kernel memory when providing sinfo for sctp_recvmsg().
Both bug where reported by Valentin Nechayev.
The first bug results in a kernel panic.
MFC after: 3 days.


# 60ae52f7 21-Jun-2010 Ed Schouten <ed@FreeBSD.org>

Use ISO C99 integer types in sys/kern where possible.

There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.


# f0c0d399 06-May-2010 Alan Cox <alc@FreeBSD.org>

Remove page queues locking from all sf_buf_mext()-like functions. The page
lock now suffices.

Fix a couple nearby style violations.


# 52683078 06-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate a small bit of unneeded code from kern_sendfile(): While
kern_sendfile() is running, the file's vm object can't be destroyed
because kern_sendfile() increments the vm object's reference count. (Once
kern_sendfile() decrements the reference count and returns, the vm object
can, however, be destroyed. So, sf_buf_mext() must handle the case where
the vm object is destroyed.)

Reviewed by: kib


# 91381493 02-May-2010 Alan Cox <alc@FreeBSD.org>

This is the first step in transitioning responsibility for synchronizing
access to the page's wire_count from the page queues lock to the page lock.

Submitted by: kmacy


# a0b8e597 02-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Lock the page around hold_count access.

Reviewed by: alc


# 9fb7bf55 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205318:
Properly handle compat32 calls to sctp generic sendmsd/recvmsg functions that
take iov.


# 003465f5 02-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205317:
Remove dead statement.


# 131e8de2 02-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205316:
Fix two style issues.


# 5322f02e 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Properly handle compat32 calls to sctp generic sendmsd/recvmsg functions that
take iov.

Reviewed by: tuexen
MFC after: 2 weeks


# fd9d1e76 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove dead statement.

Reviewed by: tuexen
MFC after: 2 weeks


# 0a977ede 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix two style issues.

MFC after: 2 weeks


# 0454fe84 18-Feb-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Use NULL instead of 0 when setting up pointer.


# fc946fd0 27-Dec-2009 Matt Jacob <mjacob@FreeBSD.org>

MFC 200620,200621: fix argument order to mtx_init call.


# e7d829a4 16-Dec-2009 Matt Jacob <mjacob@FreeBSD.org>

Fix argument order in a call to mtx_init.

MFC after: 1 week


# cf19fced 07-Dec-2009 Michael Tuexen <tuexen@FreeBSD.org>

MFC 197288,197326,197327,197328,197342,197914,197929,
197955,199365,199370,199371,199373,199866
This MFCs all SCTP/VNET relevant fixes from head.

Approved by: rrs (mentor)


# c3568f6f 17-Nov-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198853:
If socket buffer space appears to be lower then sum of count of
already prepared bytes and next portion of transfer, inner loop of
kern_sendfile() aborts, not preparing next mbuf for socket buffer, and
not modifying any outer loop invariants. The thread loops in the outer
loop forever.

Instead of breaking from inner loop, prepare only bytes that fit into
the socket buffer space.


# 1c89fc75 02-Nov-2009 Konstantin Belousov <kib@FreeBSD.org>

If socket buffer space appears to be lower then sum of count of already
prepared bytes and next portion of transfer, inner loop of kern_sendfile()
aborts, not preparing next mbuf for socket buffer, and not modifying
any outer loop invariants. The thread loops in the outer loop forever.

Instead of breaking from inner loop, prepare only bytes that fit into
the socket buffer space.

In collaboration with: pho
Reviewed by: bz
PR: kern/138999
MFC after: 2 weeks


# 7415a41f 29-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix style issue.


# 88c45ef7 08-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197662:
Do not dereference vp->v_mount without holding vnode lock and checking
that the vnode is not reclaimed.

Approved by: re (bz)


# 75ffdc40 30-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not dereference vp->v_mount without holding vnode lock and checking
that the vnode is not reclaimed.

Noted by: Igor Sysoev <is rambler-co ru>
MFC after: 1 week


# 8518270e 19-Sep-2009 Michael Tuexen <tuexen@FreeBSD.org>

Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 15ca46f6 01-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Audit file descriptor numbers for various socket-related system calls.

Approved by: re (audit argument blanket)
MFC after: 3 days


# 9e4c1521 01-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Define missing audit argument macro AUDIT_ARG_SOCKET(), and
capture the domain, type, and protocol arguments to socket(2)
and socketpair(2).

Approved by: re (audit argument blanket)
MFC after: 3 days


# c03528b6 10-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

SCTP needs either IPv4 or IPv6 as lower layer[1].
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.

Discussed with: tuexen [1]


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# f93bfb23 02-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from: TrustedBSD Project


# 4202e1be 30-May-2009 Dmitry Chagin <dchagin@FreeBSD.org>

Split native socketpair() syscall onto kern_socketpair() which should
be used by kernel consumers and socketpair() itself.

Approved by: kib (mentor)
MFC after: 1 month


# bf422e5f 13-May-2009 Jeff Roberson <jeff@FreeBSD.org>

- Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
- Save old file descriptor tables created on expansion until
the entire descriptor table is freed so that pointers may be
followed without regard for expanders.
- Mark the file zone as NOFREE so we may attempt to reference
potentially freed files.
- Convert several fget_locked() users to fget_unlocked(). This
requires us to manage reference counts explicitly but reduces
locking overhead in the common case.


# 2114e063 08-May-2009 Marko Zec <zec@FreeBSD.org>

A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by: bz
Approved by: julian (mentor) (an earlier version of the diff)


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# f0b9868d 11-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

sendfile doesn't modify the vnode - acquire vnode lock shared

Reviewed by: ups, jeffr


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 17c2fc0c 22-May-2008 Robert Watson <rwatson@FreeBSD.org>

When sendto(2) is called with an explicit destination address
argument, call mac_socket_check_connect() on that address before
proceeding with the send. Otherwise policies instrumenting the
connect entry point for the purposes of checking destination
addresses will not have the opportunity to check implicit
connect requests.

MFC after: 3 weeks
Sponsored by: nCircle Network Security, Inc.


# ae11a989 27-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

When writing trailers in sendfile(2), don't call kern_writev()
while holding the socket buffer lock. These leads to an
immediate panic due to recursing the socket buffer lock. This
bug was introduced in uipc_syscalls.c:1.240, but masked by
another bug until that was fixed in uipc_syscalls.c:1.269.

Note that the current fix isn't perfect, but better than
panicking: normally we guarantee that simultaneous invocations
of a system call to write on a stream socket won't be
interlaced, which is ensured by use of the socket buffer sleep
lock. This is guaranteed for the sendfile headers, but not
trailers. In practice, this is likely not a problem, but
should be fixed.

MFC after: 3 days
Pointy hat to: andre (1.240), cperciva (1.269)


# ea26d587 25-Mar-2008 Ruslan Ermilov <ru@FreeBSD.org>

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


# 49186916 23-Feb-2008 Colin Percival <cperciva@FreeBSD.org>

After finishing sending file data in sendfile(2), don't forget to send
the provided trailers. This has been broken since revision 1.240.

Submitted by: Dan Nelson
PR: kern/120948
"sounds ok to me" from: phk
MFC after: 3 days


# 60e15db9 22-Feb-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload
consists of the null-terminated name and the contents of any structure
you wish to record. A new ktrstruct() function constructs and emits a
KTR_STRUCT record. It is accompanied by convenience macros for struct
stat and struct sockaddr.

In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.

Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.

Derived from patches by Andrew Li <andrew2.li@citi.com>.

PR: kern/117836
MFC after: 3 weeks


# 1b708999 14-Feb-2008 Simon L. B. Nielsen <simon@FreeBSD.org>

Fix sendfile(2) write-only file permission bypass.

Security: FreeBSD-SA-08:03.sendfile
Submitted by: kib


# b75a1171 03-Feb-2008 Poul-Henning Kamp <phk@FreeBSD.org>

Give sendfile(2) a SF_SYNC flag which makes it wait until all mbufs
referencing the files VM pages are returned from the network stack,
making changes to the file safe.

This flag does not guarantee that the data has been transmitted to the
other end.


# cf827063 01-Feb-2008 Poul-Henning Kamp <phk@FreeBSD.org>

Give MEXTADD() another argument to make both void pointers to the
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.

Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.

Likely break the only non-convetional user of the API, after informing
the relevant committer.

Update the mbuf(9) manual page, which was already out of sync on
this point.

Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.

This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.

This does not affect the memory layout or size of mbufs.

Parental oversight by: sam and rwatson.

No MFC is anticipated.


# 265de5bb 31-Jan-2008 Robert Watson <rwatson@FreeBSD.org>

Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 397c19d1 29-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by: kris, pho


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 2afb3e84 26-Aug-2007 Randall Stewart <rrs@FreeBSD.org>

- During shutdown pending, when the last sack came in and
the last message on the send stream was "null" but still
there, a state we allow, we could get hung and not clean
it up and wait for the shutdown guard timer to clear the
association without a graceful close. Fix this so that
that we properly clean up.
- Added support for Multiple ASCONF per new RFC. We only
(so far) accept input of these and cannot yet generate
a multi-asconf.
- Sysctl'd support for experimental Fast Handover feature. Always
disabled unless sysctl or socket option changes to enable.
- Error case in add-ip where the peer supports AUTH and ADD-IP
but does NOT require AUTH of ASCONF/ASCONF-ACK. We need to
ABORT in this case.
- According to the Kyoto summit of socket api developers
(Solaris, Linux, BSD). We need to have:
o non-eeor mode messages be atomic - Fixed
o Allow implicit setup of an assoc in 1-2-1 model if
using the sctp_**() send calls - Fixed
o Get rid of HAVE_XXX declarations - Done
o add a sctp_pr_policy in hole in sndrcvinfo structure - Done
o add a PR_SCTP_POLICY_VALID type flag - yet to-do in a future patch!
- Optimize sctp6 calls to reuse code in sctp_usrreq. Also optimize
when we close sending out the data and disabling Nagle.
- Change key concatenation order to match the auth RFC
- When sending OOTB shutdown_complete always do csum.
- Don't send PKT-DROP to a PKT-DROP
- For abort chunks just always checksums same for
shutdown-complete.
- inpcb_free front state had a bug where in queue
data could wedge an assoc. We need to just abandon
ones in front states (free_assoc).
- If a peer sends us a 64k abort, we would try to
assemble a response packet which may be larger than
64k. This then would be dropped by IP. Instead make
a "minimum" size for us 64k-2k (we want at least
2k for our initack). If we receive such an init
discard it early without all the processing.
- When we peel off we must increment the tcb ref count
to keep it from being freed from underneath us.
- handling fwd-tsn had bugs that caused memory overwrites
when given faulty data, fixed so can't happen and we
also stop at the first bad stream no.
- Fixed so comm-up generates the adaption indication.
- peeloff did not get the hmac params copied.
- fix it so we lock the addr list when doing src-addr selection
(in future we need to use a multi-reader/one writer lock here)
- During lowlevel output, we could end up with a _l_addr set
to null if the iterator is calling the output routine. This
means we would possibly crash when we gather the MTU info.
Fix so we only do the gather where we have a src address
cached.
- we need to be sure to set abort flag on conn state when
we receive an abort.
- peeloff could leak a socket. Moved code so the close will
find the socket if the peeloff fails (uipc_syscalls.c)

Approved by: re@freebsd.org(Ken Smith)


# 0bf686c1 06-Aug-2007 Robert Watson <rwatson@FreeBSD.org>

Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)


# b8709d23 01-Jul-2007 Randall Stewart <rrs@FreeBSD.org>

- Add some needed error checking on bad fd passing in the sctp
syscalls.
Approved by: re@freebsd.org (Ken Smith)
Obtained from: Weongyo Jeong (weongyo.jeong@gmail.com)


# 1c9dbd15 19-May-2007 Andre Oppermann <andre@FreeBSD.org>

In kern_sendfile() adjust byte accounting of the file sending loop to
ignore the size of any headers that were passed with the sendfile(2)
system call. Otherwise the file sent will be truncated by the header
size if the nbytes parameter was provided. The bug doesn't show up
when either nbytes is zero, meaning send the whole file, or no header
iovec is provided.

Resolve a potential error aliasing of errors from the VM and sf_buf
parts and the protocol send parts where an error of the latter over-
writes one of the former.

Update comments.

The byte accounting bug wasn't seen in earlier because none of the popular
sendfile(2) consumers, Apache, lighttpd and our ftpd(8) use it in modes
that trigger it. The varnish HTTP proxy makes full use of it and exposed
the problem.

Bug found by: phk
Tested by: phk


# d19e16a7 16-May-2007 Robert Watson <rwatson@FreeBSD.org>

Generally migrate to ANSI function headers, and remove 'register' use.


# 7abab911 03-May-2007 Robert Watson <rwatson@FreeBSD.org>

sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.


# eed20b37 20-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Don't reinvent vm_page_grab().

Reviewed by: ups


# fb1daf81 18-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix a bug in sendfile(2) when files larger than page size and nbytes=0.
When nbytes=0, sendfile(2) should use file size. Because of the bug, it
was sending half of a file. The bug is that 'off' variable can't be used
for size calculation, because it changes inside the loop, so we should
use uap->offset instead.


# 7b20aa9c 06-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Remove XXX comment that changes to file fields should be protected with
the file lock rather than the filedesc lock: I fixed this in the last
revision.

Spotted by: kris


# 5e3f7694 04-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 1ce2bc91 02-Apr-2007 John Baldwin <jhb@FreeBSD.org>

Fix a fd leak in socketpair():
- Close the new file objects created during socketpair() if the copyout of
the new file descriptors fails.
- Add a test to the socketpair regression test for this edge case.


# 873fbcd7 05-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 0c14ff0e 04-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 6dbde030 23-Jan-2007 Randall Stewart <rrs@FreeBSD.org>

Fixes the MSG_PEEK for sctp_generic_recvmsg() the msg_flags
were not being copied in properly so PEEK and any other
msg_flags input operation were not being performed right.
Approved by: gnn


# 3e932ca7 12-Nov-2006 Andre Oppermann <andre@FreeBSD.org>

In kern_sendfile() fix the calculation of sbytes (the total number of bytes
written to the socket). The rewrite in revision 1.240 got confused by the
FreeBSD 4.x bug compatibility code.

For some reason lighttpd, that was used for testing the new sendfile code,
was not affected by the problem but apache and others using headers/trailers
in the sendfile call received incorrect sbytes values after return from non-
blocking sockets. This then lead to restarts with wrong offsets and thus
mixed up file contents when the socket was writeable again. All programs
not using headers/trailers, like ftpd, were not affected by the bug.

Reported by: Pawel Worach <pawel.worach-at-gmail.com>
Tested by: Pawel Worach <pawel.worach-at-gmail.com>


# 62b36a7f 07-Nov-2006 Andre Oppermann <andre@FreeBSD.org>

Style cleanups to the sctp_* syscall functions.


# bda8b1f3 06-Nov-2006 Andre Oppermann <andre@FreeBSD.org>

Handle early errors in kern_sendfile() by introducing a new goto 'out'
label after the sbunlock() part.

This correctly handles calls to sendfile(2) without valid parameters
that was broken in rev. 1.240.

Coverity error: 272162


# f8829a4a 03-Nov-2006 Randall Stewart <rrs@FreeBSD.org>

Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by: gnn
Approved by: gnn


# 5e20f43d 02-Nov-2006 Andre Oppermann <andre@FreeBSD.org>

Rename m_getm() to m_getm2() and rewrite it to allocate up to page sized
mbuf clusters. Add a flags parameter to accept M_PKTHDR and M_EOR mbuf
chain flags. Provide compatibility macro for m_getm() calling m_getm2()
with M_PKTHDR set.

Rewrite m_uiotombuf() to use m_getm2() for mbuf allocation and do the
uiomove() in a tight loop over the mbuf chain. Add a flags parameter to
accept mbuf flags to be passed to m_getm2(). Adjust all callers for the
extra parameter.

Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month


# d99b0dd2 02-Nov-2006 Andre Oppermann <andre@FreeBSD.org>

Rewrite kern_sendfile() to work in two loops, the inner which turns as many
VM pages into mbufs as it can -- up to the free send socket buffer space.
The outer loop then drops the whole mbuf chain into the send socket buffer,
calls tcp_output() on it and then waits until 50% of the socket buffer are
free again to repeat the cycle. This way tcp_output() gets the full amount
of data to work with and can issue up to 64K sends for TSO to chop up in
the network adapter without using any CPU cycles. Thus it gets very efficient
especially with the readahead the VM and I/O system do.

The previous sendfile(2) code simply looped over the file, turned each 4K
page into an mbuf and sent it off. This had the effect that TSO could only
generate 2 packets per send instead of up to 44 at its maximum of 64K.

Add experimental SF_MNOWAIT flag to sendfile(2) to return ENOMEM instead of
sleeping on mbuf allocation failures.

Benchmarking shows significant improvements (95% confidence):
45% less cpu (or 1.81 times better) with new sendfile vs. old sendfile (non-TSO)
83% less cpu (or 5.7 times better) with new sendfile vs. old sendfile (TSO)

(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)

Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# 7c4b7ecc 05-Aug-2006 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock in kern_sendfile() now that
vm_page_sleep_if_busy() no longer requires the caller to hold the page
queues lock.


# 10c09f3f 03-Aug-2006 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_io_start(). Reduce
the scope of the page queues lock in kern_sendfile() accordingly.


# f30e89ce 27-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Fix a file descriptor race I reintroduced when I split accept1() up into
kern_accept() and accept1(). If another thread closed the new file
descriptor and the first thread later got an error trying to copyout the
socket address, then it would attempt to close the wrong file object. To
fix, add a struct file ** argument to kern_accept(). If it is non-NULL,
then on success kern_accept() will store a pointer to the new file object
there and not release any of the references. It is up to the calling code
to drop the references appropriately (including a call to fdclose() in case
of error to safely handle the aforementioned race). While I'm at it, go
ahead and fix the svr4 streams code to not leak the accept fd if it gets an
error trying to copyout the streams structures.


# b0668f71 24-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod: sam, gnn, wollman


# b33887ea 19-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Don't free the sockaddr in kern_bind() and kern_connect() as not all
callers pass a sockaddr allocated via malloc() from M_SONAME anymore.
Instead, free it in the callers when necessary.


# c870740e 10-Jul-2006 John Baldwin <jhb@FreeBSD.org>

- Split out kern_accept(), kern_getpeername(), and kern_getsockname() for
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat. Specifically, go ahead
and hard code UIO_USERSPACE in the uio as that's what all the callers
specify. In place, add a new uioseg to indicate what type of pointer
is in mp->msg_name. Previously it was always a userland address, but
ABI emulators may pass in kernel-side sockaddrs. Also, remove the
namelenp field and instead require the two places that used it to
explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.


# fb11be62 19-Jun-2006 George V. Neville-Neil <gnn@FreeBSD.org>

Properly cast the values of valsize (the size of the value passed in)
in setsockopt so that they can be compared correctly against negative
values. Passing in a negative value had a rather negative effect
on our socket code, making it impossible to open new sockets.

PR: 98858
Submitted by: James.Juran@baesystems.com
MFC after: 1 week


# b37ffd31 10-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

Move some functions and definitions from uipc_socket2.c to uipc_socket.c:

- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after: 1 month


# 20bdac8a 25-May-2006 Robert Watson <rwatson@FreeBSD.org>

Use getsock() and fput() instead of fgetsock() and fputsock() in
sendfile(). This causes sendfile() to use the file descriptor
reference to the socket instead of bumping the socket reference
count, which avoids an additional refcount operation, as well as a
potential expensive socket refcount drop, which can lead to
contention on the accept mutex. This change also has the side
effect of further reducing the number of cases where an in-progress
I/O operation can occur on a socket after close, as using the file
descriptor refcount prevents the socket from closing while in use.

MFC after: 3 months


# 102ea033 25-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Extend getsock() to return the struct file flags read while holding the
file lock, in the style of fgetsock().

Modify accept1() to use getsock() instead of fgetsock(), relying on the
file descriptor reference rather than an acquired socket reference to
prevent the listen socket from being destroyed during accept(). This
avoids additional reference count operations, which should improve
performance, and also avoids accept1() operating on a socket whose file
descriptor has been torn down, which may have resulted in protocol
shutdown starting.

MFC after: 3 months


# fa4c5373 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Add comment to accept1() that it should use getsock() instead of fgetsock()
to avoid additional mutex operations, and also to avoid use of soref/sorele
which are now not preferred.

MFC after: 3 months


# 7c8dcf2d 26-Mar-2006 Alan Cox <alc@FreeBSD.org>

Use NET_LOCK_GIANT() and VFS_LOCK_GIANT() instead of unconditionally
acquiring Giant in kern_sendfile().

Guard against the forced reclamation of a vnode in kern_sendfile().

Discussed with: jeff
Reviewed by: tegge
MFC after: 3 weeks


# fa545f43 28-Feb-2006 Paul Saab <ps@FreeBSD.org>

Fix 32bit sendfile by implementing kern_sendfile so that it takes
the header and trailers as iovec arguments instead of copying them
in inside of sendfile.

Reviewed by: jhb
MFC after: 3 weeks


# ecc44de7 31-Oct-2005 Paul Saab <ps@FreeBSD.org>

Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by: ps, ups
Reviewed by: jhb


# a372f822 14-Oct-2005 Paul Saab <ps@FreeBSD.org>

Implement the 32bit versions of recvmsg, recvfrom, sendmsg

Partially obtained from: jhb


# 6758f88e 05-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Add MAC Framework and MAC policy entry point mac_check_socket_create(),
which is invoked from socket() and socketpair(), permitting MAC
policy modules to control the creation of sockets by domain, type, and
protocol.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)
Requested by: SCC


# 75ae2570 04-May-2005 Maksim Yevmenkin <emax@FreeBSD.org>

Change m_uiotombuf so it will accept offset at which data should be copied
to the mbuf. Offset cannot exceed MHLEN bytes. This is currently used to
fix Ethernet header alignment problem on alpha and sparc64. Also change all
users of m_uiotombuf to pass proper offset.

Reviewed by: jmg, sam
Tested by: Sten Spans "sten AT blinkenlights DOT nl"
MFC after: 1 week


# 7f53207b 16-Apr-2005 Robert Watson <rwatson@FreeBSD.org>

Introduce three additional MAC Framework and MAC Policy entry points to
control socket poll() (select()), fstat(), and accept() operations,
required for some policies:

poll() mac_check_socket_poll()
fstat() mac_check_socket_stat()
accept() mac_check_socket_accept()

Update mac_stub and mac_test policies to be aware of these entry points.
While here, add missing entry point implementations for:

mac_stub.c stub_check_socket_receive()
mac_stub.c stub_check_socket_send()
mac_test.c mac_test_check_socket_send()
mac_test.c mac_test_check_socket_visible()

Obtained from: TrustedBSD Project
Sponsored by: SPAWAR, SPARTA


# f247a524 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


# 8d6e40c3 08-Mar-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Add kernel-only flag MSG_NOSIGNAL to be used in emulation layers to surpress
SIGPIPE signal for the duration of the sento-family syscalls. Use it to
replace previously added hack in Linux layer based on temporarily setting
SO_NOSIGPIPE flag.

Suggested by: alfred


# 29bdd019 18-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

Remove now unused 'int s' from spl().

MFC after: 3 days


# d8d716be 18-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

De-spl kern_connect().

MFC after: 3 days


# 1e8f8954 17-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

In accept1(), extend coverage of the socket lock from just covering
soref() to also covering the update of so_state. While no other user
threads can update the socket state here as it's not yet hooked up to
the file descriptor array yet, the protocol could also frob the
socket state here, leading to a lost update to the so_state field.
No reported instances of this bug (as yet).

MFC after: 3 days


# a6886ef1 30-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Extend kern_sendit() to take another enum uio_seg argument, which specifies
where the buffer to send lies and use it to eliminate yet another stackgap
in linuxlator.

MFC after: 2 weeks


# 8516dd18 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# f6dc414a 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Save a line by unlocking before we test.


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 124e4c3b 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.


# d3cb0d99 07-Nov-2004 Alan Cox <alc@FreeBSD.org>

Introduce two new options, "CPU private" and "no wait", to sf_buf_alloc().
Change the spelling of the "catch" option to be consistent with the new
options. Implement the "no wait" option. An implementation of the "CPU
private" for i386 will be committed at a later date.


# ef11fbd7 07-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce fdclose() which will clean an entry in a filedesc.

Replace homerolled versions with call to fdclose().

Make fdunused() static to kern_descrip.c


# b1fa7527 07-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Use fget_locked() instead of homerolled


# d19ef814 03-Nov-2004 Alan Cox <alc@FreeBSD.org>

The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.


# ae1e5c5d 24-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Move from using the socket reference count to the file reference
count to prevent sockets from being garbage collected during
socket-specific system calls. This is the same approach used in
most VFS-specific system calls, as well as generic file descriptor
system calls such as read() and write().

To do this, add a utility function getsock(), which is logically
identical to getvnode() used for the same purpose in VFS. Unlike
fgetsock(), it returns with the file reference count elevated, but
no bump of the socket reference count. Replace matching calls to
fputsock() with fdrop().

This change is made to all socket system calls other than
sendfile() and accept(), but the approach should be applicable to
those system calls also.

This shaves about four mutex operations off of each of these
system calls, including send() and recv() variants, adding about
1% to pps on minimal UDP packets for UP using netblast, and 4% on
SMP.

Reviewed by: pjd


# 01ad40da 24-Oct-2004 Alan Cox <alc@FreeBSD.org>

Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup().


# 0f777d7d 20-Oct-2004 Alan Cox <alc@FreeBSD.org>

Modify the vm object locking in do_sendfile() so that the containing object
is locked when vm_page_io_finish() is called on a page. This is to satisfy
a new, post-RELENG_5 assertion in vm_page_io_finish(). (I am in the
process of transitioning the responsibility for synchronizing access to
various fields/flags on the page from the global page queues lock to the
per-object lock.)

Tripped over by: obrien@


# 86dac448 01-Oct-2004 Alan Cox <alc@FreeBSD.org>

Add a SOCKBUF_LOCK() to a rarely executed path in do_sendfile().


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# e140eb43 17-Jul-2004 David Malone <dwmalone@FreeBSD.org>

Add a kern_setsockopt and kern_getsockopt which can read the option
values from either user land or from the kernel. Use them for
[gs]etsockopt and to clean up some calls to [gs]etsockopt in the
Linux emulation code that uses the stackgap.


# 552afd9c 10-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Clean up and wash struct iovec and struct uio handling.

Add copyiniov() which copies a struct iovec array in from userland into
a malloc'ed struct iovec. Caller frees.

Change uiofromiov() to malloc the uio (caller frees) and name it
copyinuio() which is more appropriate.

Add cloneuio() which returns a malloc'ed copy. Caller frees.

Use them throughout.


# 6ec70e64 08-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Remove spl()'s from do_sendfile().


# ad6b0eff 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Acquire socket lock in the "waiting for connection" loop in
kern_connect(), replacing tsleep() with msleep() with the socket
mutex.


# a3146ff9 22-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

Fix an inconsistency in socket option propagation on accept(). Propagate
the SS_NBIO flag from the parent socket to the child socket during an
accept() operation.

The file descriptor O_NONBLOCK flag would have been propagated already
by the fflag assignment, and therefore would have been inconsistent
with the underlying socket's so_state member.

This makes accept() more closely adhere to the API contract we effectively
outline in the manual page. Note also that Linux continues to differ here;
O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as
does Solaris. The Single UNIX Specification does not offer specific
advice on this issue.

PR: kern/45733
Requested by: Jayanth Vijayaraghavan
Reviewed by: rwatson


# 31f555a1 18-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state. Convert tsleep() into msleep() with socket buffer mutex
as argument. Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state. Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation. Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy. Assert the socket buffer mutex
strategically after code sections or at the beginning of loops. In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive(). This commit only merges basic socket locking in this
code; follow-up commits will close additional races. As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by: juli, tjr


# c0b99ffa 14-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 310e7ceb 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.


# 3e87b34a 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Correct whitespace errors in merge from rwatson_netperf: tabs instead of
spaces, no trailing tab at the end of line.

Pointed out by: csjp


# 395a08c9 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 1930e303 11-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Deorbit COMPAT_SUNOS.

We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.


# aa57bb04 07-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Correct a resource leak introduced in recent accept locking changes:
when I reordered events in accept1() to allocate a file descriptor
earlier, I didn't properly update use of goto on exit to unwind for
cases where the file descriptor is now held, but wasn't previously.
The result was that, in the event of accept() on a non-blocking socket,
or in the event of a socket error, a file descriptor would be leaked.

This ended up being non-fatal in many cases, as the file descriptor
would be properly GC'd on process exit, so only showed up for processes
that do a lot of non-blocking accept() calls, and also live for a long
time (such as qmail).

This change updates the use of goto targets to do additional unwinding.

Eyes provided by: Brian Feldman <green@freebsd.org>
Feet, hands provided by: Stefan Ehmann <shoesoft@gmx.net>,
Dimitry Andric <dimitry@andric.com>
Arjan van Leeuwen <avleeuwen@piwebs.com>


# 7a1a900c 07-Jun-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

allow more than MLEN bytes for ancillary data to meet the
requirement of Section 20.1 of RFC3542.

Obtained from: KAME
MFC after: 1 week


# 2658b3bb 01-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

so_qlen so_incqlen so_qstate
so_comp so_incomp so_list
so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.

- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.

- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.


# 36568179 31-May-2004 Robert Watson <rwatson@FreeBSD.org>

The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.


# 099a0e58 31-May-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# f7250466 07-May-2004 Robert Watson <rwatson@FreeBSD.org>

Unconditionally lock Giant in do_sendfile(), rather than locking it
conditional on debug.mpsafenet. We can try pushing down Giant here
later, but we don't want to enter VFS without holding Giant.

Bumped into by: kris


# 5a324893 05-May-2004 Alan Cox <alc@FreeBSD.org>

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


# e8410540 08-Apr-2004 Mike Silbersack <silby@FreeBSD.org>

Fix a regression in my change which sends headers along with data; a
side effect of that change caused headers to not be sent if a 0 byte
file was passed to sendfile. This change fixes that behavior, allowing
sendfile to send out the headers even with a 0 byte file again.

Noticed by: Dirk Engling


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 051bbf60 04-Apr-2004 Robert Watson <rwatson@FreeBSD.org>

Detatch incorrect spellings of detach.


# 121230a4 03-Apr-2004 Alan Cox <alc@FreeBSD.org>

In some cases, sf_buf_alloc() should sleep with pri PCATCH; in others, it
should not. Add a new parameter so that the caller can specify which is
the case.

Reported by: dillon


# 627e4a99 28-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Conditionally acquire Giant when entering the sockets layer via the
socket-specific system calls based on debug.mpsafenet, rather than
acquiring Giant unconditionally.


# 74041f5a 28-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

When validating that the length sum in recvit(), we fail to release
Giant on an error. Add a Giant acquisition.

Reviewed by: sam, bms


# 90ecfebd 16-Mar-2004 Alan Cox <alc@FreeBSD.org>

Refactor the existing machine-dependent sf_buf_free() into a machine-
dependent function by the same name and a machine-independent function,
sf_buf_mext(). Aside from the virtue of making more of the code machine-
independent, this change also makes the interface more logical. Before,
sf_buf_free() did more than simply undo an sf_buf_alloc(); it also
unwired and if necessary freed the page. That is now the purpose of
sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used
as a general-purpose emphemeral map cache.


# 0b759971 03-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Remove unneeded label 'done2' from socket(). We now grab Giant
only around socreate(), and don't need it for file descriptor
accesses.

Submitted by: sam


# b49d824e 08-Feb-2004 Mike Silbersack <silby@FreeBSD.org>

Add the SF_NODISKIO flag to sendfile. This flag causes sendfile to be
mindful of blocking on disk I/O and instead return EBUSY when such
blocking would occur.

Results from the DeBox project indicate that blocking on disk I/O
can slow the performance of a kqueue/poll based webserver. Using
a flag such as SF_NODISKIO and throwing connections that would block
to helper processes/threads helped increase performance.

Currently, only the Flash webserver uses this flag, although it could
probably be applied to thttpd with relative ease.

Idea by: Yaoping Ruan & Vivek Pai


# ff5e43a3 04-Feb-2004 Mike Silbersack <silby@FreeBSD.org>

Rename iov_to_uio to uiofromiov to be more consistent with other
uio* functions.

Suggested by: bde


# beb699c7 01-Feb-2004 Mike Silbersack <silby@FreeBSD.org>

Rewrite sendfile's header support so that headers are now sent in the first
packet along with data, instead of in their own packet. When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network. Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)

Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.


# 54556cc7 19-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

One more instance of magic number used in place of IO_SEQSHIFT.

Submitted by: alc


# a2fe44e8 15-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed. It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).


# d7a1c7e3 11-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

Back out 1.166, which was committed by mistake.


# f1ea6d81 11-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

Mechanical whitespace cleanup + other minor style nits.


# 012b5531 11-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

Mechanical whitespace cleanup + minor style nits.


# d41457da 10-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

More unparenthesized return values.


# b91a5997 10-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

Style: parenthesize return values.


# 2b77864f 10-Jan-2004 Don Lewis <truckman@FreeBSD.org>

Add a somewhat redundant check on the len arguement to getsockaddr() to
avoid relying on the minimum memory allocation size to avoid problems.
The check is somewhat redundant because the consumers of the returned
structure will check that sa_len is a protocol-specific larger size.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: nectar
MFC after: 30 days


# ddeb5b24 28-Dec-2003 Mike Silbersack <silby@FreeBSD.org>

Track three new sendfile-related statistics:
- The number of times sendfile had to do disk I/O
- The number of times sfbuf allocation failed
- The number of times sfbuf allocation had to wait


# 93220782 25-Dec-2003 David Malone <dwmalone@FreeBSD.org>

In socket(2) we only need Giant around the call to socreate, so just
grab it there.


# 9f144cff 24-Dec-2003 Alfred Perlstein <alfred@FreeBSD.org>

Add restrict qualifiers.

PR: 44394
Submitted by: Craig Rodrigues <rodrige@attbi.com>


# 186e347f 01-Dec-2003 David Greenman <dg@FreeBSD.org>

Fixed a bug in sendfile(2) where the sent data would be corrupted due
to sendfile(2) being erroneously automatically restarted after a signal
is delivered. Fixed by converting ERESTART to EINTR prior to exiting.

Updated manual page to indicate the potential EINTR error, its cause
and consequences.

Approved by: re@freebsd.org


# e45db9b8 15-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
- Move the sf_buf API to its own header file; make struct sf_buf's
definition machine dependent. In this commit, we remove an
unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
Ultimately, we may eliminate struct sf_buf on those architecures
except as an opaque pointer that references a vm page.


# e1419c08 19-Oct-2003 David Malone <dwmalone@FreeBSD.org>

falloc allocates a file structure and adds it to the file descriptor
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.

As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.

To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.

Reviewed by: iedowse
Idea run past: jhb


# 411d10a6 29-Aug-2003 Alan Cox <alc@FreeBSD.org>

Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy
sockets into machine-dependent files. The rationale for this
migration is illustrated by the modified amd64 allocator. It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space. On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page. Yuck.

Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.


# 660ebf0e 11-Aug-2003 Alexander Kabaev <kan@FreeBSD.org>

Drop Giant in recvit before returning an error to the caller to avoid
leaking the Giant on the syscall exit.


# b81694ed 06-Aug-2003 Yaroslav Tykhiy <ytykhiy@gmail.com>

If connect(2) has been interrupted by a signal and therefore the
connection is to be established asynchronously, behave as in the
case of non-blocking mode:

- keep the SS_ISCONNECTING bit set thus indicating that
the connection establishment is in progress, which is the case
(clearing the bit in this case was just a bug);

- return EALREADY, instead of the confusing and unreasonable
EADDRINUSE, upon further connect(2) attempts on this socket
until the connection is established (this also brings our
connect(2) into accord with IEEE Std 1003.1.)


# d2cce3d6 04-Aug-2003 David Malone <dwmalone@FreeBSD.org>

Do some minor Giant pushdown made possible by copyin, fget, fdrop,
malloc and mbuf allocation all not requiring Giant.

1) ostat, fstat and nfstat don't need Giant until they call fo_stat.
2) accept can copyin the address length without grabbing Giant.
3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit.
4) move Giant grabbing from each indivitual recv* syscall to recvit.


# efd02757 01-Aug-2003 Alan Cox <alc@FreeBSD.org>

Use kmem_alloc_nofault() rather than kmem_alloc_pageable() in sf_buf_init().
(See revision 1.140 of kern/sys_pipe.c for a detailed rationale.)

Submitted by: tegge


# 8d5f9131 18-Jun-2003 Don Lewis <truckman@FreeBSD.org>

VOP_GETVOBJECT() wants to be called with the vnode lock held.


# c10c5378 11-Jun-2003 Alan Cox <alc@FreeBSD.org>

Finish the vm object locking in sendfile(2). More generally,
the vm locking in sendfile(2) is complete.


# 2ab3670a 11-Jun-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm object when removing a page.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# de1cab2b 29-May-2003 David Malone <dwmalone@FreeBSD.org>

Grab giant in sendit rather than kern_sendit because sockargs may
allocate mbufs with M_TRYWAIT, which may require Giant.

Reviewed by: bmilekic
Approved by: re (scottl)


# 710c5645 05-May-2003 David Malone <dwmalone@FreeBSD.org>

Split sendit into two parts. The first part, still called sendit, that
does the copyin stuff and then calls the second part kern_sendit to do
the hard work. Don't bother holding Giant during the copyin phase.

The intent of this is to allow the Linux emulator to impliment send*
syscalls without using the stackgap.


# 7be80f55 30-Mar-2003 Alan Cox <alc@FreeBSD.org>

Recent changes to uipc_cow.c have eliminated the need for some sf_buf-
related variables to be global. Make them either local to sf_buf_init() or
static.


# 9f6d45b1 28-Mar-2003 Alan Cox <alc@FreeBSD.org>

Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part
of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it.

The ultimate reason for this change is to enable two optimizations:
(1) that there never be more than one sf_buf mapping a vm_page at a time
and (2) 64-bit architectures can transparently use their 1-1 virtual
to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter()
and pmap_qremove().


# 42de97a5 16-Mar-2003 Alan Cox <alc@FreeBSD.org>

Pass the sf buf to MEXTADD() as the optional argument. This permits
the simplification of socow_iodone() and sf_buf_free(); they don't
have to reverse engineer the sf buf from the data's address.


# 7c4351aa 05-Mar-2003 Alan Cox <alc@FreeBSD.org>

Remove GIANT_REQUIRED from sf_buf_free().


# 6a07a139 23-Feb-2003 Tor Egge <tegge@FreeBSD.org>

Sync new socket nonblocking/async state with file flags in accept().

PR: 1775
Reviewed by: mbr


# d6bf2378 19-Feb-2003 Olivier Houchard <cognet@FreeBSD.org>

Remove duplicate includes.

Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 12e4397e 03-Feb-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Break out the bind and connect syscalls to intend to make calling
these syscalls internally easy.
This is preparation for force coming IPv6 support for Linuxlator.

Submitted by: dwmalone
MFC after: 10 days


# 8deebb01 02-Feb-2003 Alfred Perlstein <alfred@FreeBSD.org>

Consolidate MIN/MAX macros into one place (param.h).

Submitted by: Hiten Pandya <hiten@unixdaemons.com>


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 48e3128b 12-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# cd72f218 11-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# 08c7670a 23-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Move the declaration of the socket fileops from socketvar.h to file.h.
This allows us to use the new typedefs and removes the needs for a number
of forward struct declarations in socketvar.h


# b371c939 06-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Integrate mac_check_socket_send() and mac_check_socket_receive()
checks from the MAC tree: allow policies to perform access control
for the ability of a process to send and receive data via a socket.
At some point, we might also pass in additional address information
if an explicit address is requested on send.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 91e97a82 02-Oct-2002 Don Lewis <truckman@FreeBSD.org>

In an SMP environment post-Giant it is no longer safe to blindly
dereference the struct sigio pointer without any locking. Change
fgetown() to take a reference to the pointer instead of a copy of the
pointer and call SIGIO_LOCK() before copying the pointer and
dereferencing it.

Reviewed by: rwatson


# f2f03122 28-Aug-2002 Archie Cobbs <archie@FreeBSD.org>

accept(2) on a socket that has been shutdown(2) normally returns
ECONNABORTED. Make this happen in the non-blocking case as well.
The previous behavior was to return EAGAIN, which (a) is not
consistent with the blocking case and (b) causes the application
to think the socket is still valid.

PR: bin/42100
Reviewed by: freebsd-net
MFC after: 3 days


# 9ca43589 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 4b9c2fa1 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Fix return case for negative namelen by jumping to normal exit processing
rather than immediately returning, or we may not unlock necessary locks.

Noticed by: Mike Heffner <mheffner@acm.vt.edu>


# 9e63574e 13-Aug-2002 David Greenman <dg@FreeBSD.org>

Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h
so that they can be seen by external callers.


# a370c700 13-Aug-2002 David Greenman <dg@FreeBSD.org>

Remove obsolete comment about sf_buf_* functions being static. They were
made un-static in rev 1.114.


# 87df4f8f 11-Aug-2002 Semen Ustimenko <semenu@FreeBSD.org>

Fix sendfile(), who was calling vn_rdwr() without aresid parameter and
thus hiting EIO at the end of file. This is believed to be a feature
(not a bug) of vn_rdwr(), so we turn it off by supplying aresid param.

Reviewed by: rwatson, dg


# 5b770403 08-Aug-2002 Jacques Vidrine <nectar@FreeBSD.org>

While we're at it, add range checks similar to those in previous commit to
getsockname() and getpeername(), too.


# 82d9ad33 08-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Add additional range checks for copyout targets.

Submitted by: Silvio Cesare <silvio@qualys.com>


# f9d0d524 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by: bde


# 62f5f684 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument connect(), listen(), and bind() system calls to invoke
MAC framework entry points to permit policies to authorize these
requests. This can be useful for policies that want to limit
the activity of processes involving particular types of IPC and
network activity.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 1161b86a 30-Jul-2002 Alan Cox <alc@FreeBSD.org>

o In do_sendfile(), replace vm_page_sleep_busy() by vm_page_sleep_if_busy()
and extend the scope of the page queues lock to cover all accesses
to the page's flags and busy fields.


# 5d323204 22-Jul-2002 Andrew R. Reiter <arr@FreeBSD.org>

- Make use of the VM_ALLOC_WIRED flag in the call to vm_page_alloc() in
do_sendfile(). This allows us to rearrange an if statement in order to
avoid doing an unnecesary call to vm_page_lock_queues(), and an attempt
at re-wiring the pages (which were wired in the vm_page_alloc() call).

Reviewed by: alc, jhb


# ae0ffa73 12-Jul-2002 Alan Cox <alc@FreeBSD.org>

Lock accesses to the page queues by sendfile() and friends.


# 9c341296 12-Jul-2002 Alfred Perlstein <alfred@FreeBSD.org>

Create a bug-for-bug FreeBSD4 compatible version of sendfile and move the
fixed sendfile over. This is needed to preserve binary compatibility from
4.x to 5.x.


# a551e20e 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

nuke more instances of caddr_t


# 64f0b9d7 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

remove or replace caddr_t with void.
make the mbuf external free function take a void * rather than caddr_t.


# 98cb733c 25-Jun-2002 Kenneth D. Merry <ken@FreeBSD.org>

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# c33c8251 20-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

Implement SO_NOSIGPIPE option for sockets. This allows one to request that
an EPIPE error return not generate SIGPIPE on sockets.

Submitted by: lioux
Inspired by: Darwin


# 60a9bb19 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

Catch up to changes in ktrace API.


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 89e9e6e7 19-Apr-2002 Robert Watson <rwatson@FreeBSD.org>

In sendfile(), use the vn_rdwr() helper function, rather than manually
constructing a struct aio and invoking VOP_READ() directly. This cleans
up the code a little, but also has the advantage of making sure almost
all vnode read/write access in the kernel goes through the helper
function, meaning that instrumentation of that helper function can impact
almost all relevant read/write operations. In this case, it permits us
to put MAC hooks into vn_rdwr() and not modify uipc_syscalls.c (yet).

In general, if helper vn_*() functions exist, they should be used in
preference to direct VOP's in system call service code.

Submitted by: green
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 70f52b48 23-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 7228268a 22-Jan-2002 David Greenman <dg@FreeBSD.org>

Fixed bug in calculation of amount of file to send when nbytes !=0 and
headers or trailers are supplied. Reported by Vladislav Shabanov
<vs@rambler-co.ru>.

PR: 33771
Submitted by: Maxim Konovalov <maxim@macomnet.ru>
MFC after: 3 days


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# 078a4e89 08-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Sockets are called 'so' not 'sp'.


# 9c4d63da 31-Dec-2001 Robert Watson <rwatson@FreeBSD.org>

o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread
argument.

o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by: freebsd-arch


# b1e4abd2 16-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.


# b064d43d 13-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

remove holdfp()

Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate

introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.

introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.

*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.

This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# df998760 30-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Giant pushdown syscalls in kern/uipc_syscalls.c. Affected calls:

recvmsg(), sendmsg(), recvfrom(), accept(), getpeername(), getsockname(),
socket(), connect(), accept(), send(), recv(), bind(), setsockopt(), listen(),
sendto(), shutdown(), socketpair(), sendfile()


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# db3cc2d0 23-Jun-2001 David Malone <dwmalone@FreeBSD.org>

Don't dereference a NULL pointer if we fail to get a sendfilebuf.


# 9d127f9f 25-May-2001 John Baldwin <jhb@FreeBSD.org>

Add vm locking to sendfile(2) and sf_buf_free().

Reported by: Tamiji Homma <thomma@BayNetworks.com>
Tested by: Tamiji Homma <thomma@BayNetworks.com>


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 06336fb2 25-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

Sendfile is documented to return 0 on success, however if when a
sf_hdtr is used to provide writev(2) style headers/trailers on the
sent data the return value is actually either the result of writev(2)
from the trailers or headers of no tailers are specified.

Fix sendfile to comply with the documentation, by returning 0 on
success.

Ok'd by: dg


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 4bde2ac5 08-Mar-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Fix is a similar race condition as existed in the mbuf code. When we go
into an interruptable sleep and we increment a sleep count, we make sure
that we are the thread that will decrement the count when we wakeup.
Otherwise, what happens is that if we get interrupted (signal) and we
have to wake up, but before we get our mutex, some thread that wants
to wake us up detects that the count is non-zero and so enters wakeup_one(),
but there's nothing on the sleep queue and so we don't get woken up. The
thread will still decrement the sleep count, which is bad because we will
also decrement it again later (as we got interrupted) and are already off
the sleep queue.


# 2239c07d 08-Mar-2001 David Malone <dwmalone@FreeBSD.org>

Make the wait for sendfile buffers interruptable. Stops one process
consuming them all and then getting stuck.

Reviewed by: dg
Reviewed by: bmilekic
Observed by: Andreas Persson <pap@garen.net>


# 19eb87d2 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Grab the process lock while calling psignal and before calling psignal.


# 2fd7d53d 13-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Return ECONNABORTED from accept if connection is closed while on the
listen queue, as well as the current behavior of a zero-length sockaddr.

Obtained from: KAME
Reviewed by: -net


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 1550c317 02-Jan-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Fix the <sys/queue.h> abuse.

Submitted by: Dima Dorfman <dima@unixfreak.org>
Reviewed by: /sbin/md5


# 7f9cb018 02-Jan-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Add an XXX about a <sys/queue.h> transgression which needs cleaned up.


# 2a0c503e 21-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 8f9a5273 02-Dec-2000 David Greenman <dg@FreeBSD.org>

Changed second argument in a call to sf_buf_free() to be NULL instead of
PAGE_SIZE to match the prototype better. The argument is ignored, so this
is just to silence the compile-time warning.

Pointed out by: jhb


# 794cd879 01-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

Make sure to free the sf_buf if we've allocated it but fail to allocate
an mbuf (ENOBUFS) before returning so that we don't leak sf_bufs in
the case where we're out of mbufs.

Submitted by: David Greenman (dg)


# 279d7226 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 866746b6 12-Nov-2000 David Greenman <dg@FreeBSD.org>

Fixed a certain panic on IO error in sendfile(): Page must be set PG_BUSY
before calling vm_page_free() on it.


# e7789181 11-Nov-2000 Bosko Milekic <bmilekic@FreeBSD.org>

* Have m_pulldown() use the new M_WRITABLE() macro in order to determine
whether the given ext_buf is shared.

* Have the sf_bufs be setup with the mbuf subsystem using MEXTADD() with the
two new arguments.

Note: m_pulldown() is somewhat crotchy; the added comment explains the
situation.

Reviewed by: jlemon


# fe27eea9 04-Nov-2000 Bosko Milekic <bmilekic@FreeBSD.org>

Change the sf_bufs wakeups to be wakeup_one(), because we don't want to
wakeup all of the sleeping threads when we free only one buffer. This
avoids us having to needlessly try again (and fail, and go back to
sleep) for all the threads sleeping. We will now only wakeup the
thread we know will succeed.

Reviewed by: green


# 0eecc427 04-Nov-2000 Bosko Milekic <bmilekic@FreeBSD.org>

Setup and put to use the mutex lock for sf_freelist, the sendfile(2) bufs
freelist. Should now be thread-friendly, in part.

Note: More work is needed in uipc_syscalls.c, but it will have to wait until
the socket locking issues are at least 80% implemented and committed.


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# a5c4836d 19-Aug-2000 David Malone <dwmalone@FreeBSD.org>

Replace the mbuf external reference counting code with something
that should be better.

The old code counted references to mbuf clusters by using the offset
of the cluster from the start of memory allocated for mbufs and
clusters as an index into an array of chars, which did the reference
counting. If the external storage was not a cluster then reference
counting had to be done by the code using that external storage.

NetBSD's system of linked lists of mbufs was cosidered, but Alfred
felt it would have locking issues when the kernel was made more
SMP friendly.

The system implimented uses a pool of unions to track external
storage. The union contains an int for counting the references and
a pointer for forming a free list. The reference counts are
incremented and decremented atomically and so should be SMP friendly.
This system can track reference counts for any sort of external
storage.

Access to the reference counting stuff is now through macros defined
in mbuf.h, so it should be easier to make changes to the system in
the future.

The possibility of storing the reference count in one of the
referencing mbufs was considered, but was rejected 'cos it would
often leave extra mbufs allocated. Storing the reference count in
the cluster was also considered, but because the external storage
may not be a cluster this isn't an option.

The size of the pool of reference counters is available in the
stats provided by "netstat -m".

PR: 19866
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: alfred (glanced at by others on -net)


# 42ebfbf2 02-Jul-2000 Brian Feldman <green@FreeBSD.org>

Modify ktrace's general I/O tracing, ktrgenio(), to use a struct uio *
instead of a struct iovec * array and int len. Get rid of stupidly trying
to allocate all of the memory and copyin()ing the entire iovec[], and
instead just do the proper VOP_WRITE() in ktrwrite() using a copy of
the struct uio that the syscall originally used.

This solves the DoS which could easily be performed; to work around the
DoS, one could also remove "options KTRACE" from the kernel. This is
a very strong MFC candidate for 4.1.

Found by: art@OpenBSD.org


# 8757e5bb 12-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

unstatic getfp() so that other subsystems can use it.

make sendfile() use it.

Approved by: dg


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# f48b807f 11-Dec-1999 Brian Feldman <green@FreeBSD.org>

This is Bosko Milekic's mbuf allocation waiting code. Basically, this
means that running out of mbuf space isn't a panic anymore, and code
which runs out of network memory will sleep to wait for it.

Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: green, wollman


# 9b962c56 24-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

General clean-up of socket.h and associated sources to synchronise up
with NetBSD and the Single Unix Specification v2.

This updates some structures with other, almost equivalent types and
effort is under way to get the whole more consistent.

Also removes a double definition of INET6 and some other clean-ups.

Reviewed by: green, bde, phk
Some part obtained from: NetBSD, SUSv2 specification


# 2e3c8fcb 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# afce0034 13-Oct-1999 Brian Feldman <green@FreeBSD.org>

Add a missing spl lowering.

Submitted by: Ville-Pertti Keinonen <will@iki.fi>


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# bdf7fdcb 30-Sep-1999 Guido van Rooij <guido@FreeBSD.org>

Plug a potential filedescriptor leak. This will probably almost
never be triggered.

Reviewed by: David Greenman


# 13ccadd4 19-Sep-1999 Brian Feldman <green@FreeBSD.org>

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# e32c66c5 04-Aug-1999 Brian Feldman <green@FreeBSD.org>

Fix fd race conditions (during shared fd table usage.) Badfileops is
now used in f_ops in place of NULL, and modifications to the files
are more carefully ordered. f_ops should also be set to &badfileops
upon "close" of a file.

This does not fix other problems mentioned in this PR than the first
one.

PR: 11629
Reviewed by: peter


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# ec42cbfc 25-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Don't free the socket address if soaccept() / pru_accept() doesn't
return one.


# 257aefa7 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Addendum: The original code that the last commit 'fixed' actually did
not have a bug in it, but the last commit did make it more readable so
we are keeping it.


# 89600e86 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

There was a situation where sendfile() might attempt to initiate I/O
on a PG_BUSY page, due to a bug in its sequencing of a conditional.


# 0069f505 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fixed a potential bug ( but maybe not ), where sendfile() clears PG_BUSY
on a page without testing for waiters. Also collapsed busy wait into
new vm_page_sleep_busy() inline ( see vm/vm_page.h )


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# f1d19042 07-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# 911e8dbc 02-Dec-1998 David Greenman <dg@FreeBSD.org>

Fixed broken code in sendfile(2) when using file offsets.


# 9d2b0909 22-Nov-1998 Don Lewis <truckman@FreeBSD.org>

We can't call fsetown() from sonewconn() because sonewconn() is be called
from an interrupt context and fsetown() wants to peek at curproc, call
malloc(..., M_WAITOK), and fiddle with various unprotected data structures.
The fix is to move the code that duplicates the F_SETOWN/FIOSETOWN state
of the original socket to the new socket from sonewconn() to accept1(),
since accept1() runs in the correct context. Deferring this until the
process calls accept() is harmless since the process can't do anything
useful with SIGIO on the new socket until it has the descriptor for that
socket.

One could make the case for not bothering to duplicate the
F_SETOWN/FIOSETOWN state and requiring the process to explicitly make the
fcntl() or ioctl() call on the new socket, but this would be incompatible
with the previous implementation and might break programs which rely on
the old semantics.

This bug was discovered by Andrew Gallatin <gallatin@cs.duke.edu>.


# 4f699173 18-Nov-1998 David Greenman <dg@FreeBSD.org>

Closed a very narrow and rare race condition that involved net interrupts,
bio interrupts, and a truncated file that along with the precise alignment
of the planets could result in a page being freed multiple times or a
just-freed page being put onto the inactive queue.


# efac52b4 15-Nov-1998 David Greenman <dg@FreeBSD.org>

In sendfile(2), check against sb_lowat when filling the socket buffer,
rather than 0.


# f2efb8e4 14-Nov-1998 David Greenman <dg@FreeBSD.org>

Fixed a couple of nits in sendfile(2): clear PG_ZERO before unbusying
the page, and use passed-in "p" rather than curproc in uio struct.


# bd81f199 06-Nov-1998 David Greenman <dg@FreeBSD.org>

Added support for non-blocking sockets to sendfile(2).


# dd0b2081 05-Nov-1998 David Greenman <dg@FreeBSD.org>

Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.


# cfe8b629 22-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# 2b605d08 10-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

64bit fixes: don't cast p->p_retval to an int*.


# 115facb2 14-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a minor mbuf leak created by the previous change.
Reviewed by: phk
Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)


# aba55893 11-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

setsockopt() transports user option data in an mbuf. if the user
data is greater than MLEN, setsockopt is unable to pass it onto
the protocol handler. Allocate a cluster in such case.

PR: 2575
Reviewed by: phk
Submitted by: Julian Assange proff@iq.org


# 08637435 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 5591b823d 16-Dec-1997 Eivind Eklund <eivind@FreeBSD.org>

Make COMPAT_43 and COMPAT_SUNOS new-style options.


# 0bec68bf 14-Dec-1997 Mike Smith <msmith@FreeBSD.org>

Consult sa_len before trampling it with MSG_COMPAT set.
PR: kern/5291
Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)


# 5af7db2b 13-Dec-1997 Mike Smith <msmith@FreeBSD.org>

As described by the submitter:

... fix a bug with orecvfrom() or recvfrom() called with
the MSG_COMPAT flag on kernels compiled with the COMPAT_43 option.
The symptom is that the fromaddr is not correctly returned.

This affects the Linux emulator.

Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# fa5cde12 17-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Delete a bit of debugging code that mistakenly crept in, and as a consequence
revert rev. 1.28's header file additions which are no longer needed.


# 19c0663e 17-Aug-1997 Tor Egge <tegge@FreeBSD.org>

Use KERNBASE, not 0xf0000000.


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 9dd8309d 09-Apr-1997 Bruce Evans <bde@FreeBSD.org>

Removed support for OLD_PIPE. <sys/stat.h> is now missing the hack that
supported nameless pipes being indistinguishable from fifos. We're not
going back.


# a91b8721 30-Mar-1997 David Greenman <dg@FreeBSD.org>

In accept1(), falloc() is called after the process has awoken, but prior
to removing the connection from the queue. The problem here is that
falloc() may block and this would allow another process to accept the
connection instead. If this happens to leave the queue empty, then the
system will panic with an "accept: nothing queued".

Also changed a wakeup() to a wakeup_one() to avoid the "thundering herd"
problem on new connections in Apache (or any other application that has
multiple processes blocked in accept() for the same socket).


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 67f7ea2d 15-Oct-1996 Garrett Wollman <wollman@FreeBSD.org>

Preserve file flags in accept(2).

Submitted by: fredriks@mcs.com in PR#1775 (this implmentaion is different)


# b12e5e82 23-Aug-1996 Peter Wemm <peter@FreeBSD.org>

The socketpair(0 syscall is bogusly returning the fd numbers through
the primary and secondary return codes, causing it to not behave as
documented. This probably originates from the ancient BSD kernels that
had pipe(2) implemented by socketpair(2), there are no binaries left that
we can run that do this.

Pointed out by: Robert Withrow <witr@rwwa.com>, PR#731


# 2c37256e 11-Jul-1996 Garrett Wollman <wollman@FreeBSD.org>

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


# 82dab6ce 09-May-1996 Garrett Wollman <wollman@FreeBSD.org>

Make it possible to return more than one piece of control information
(PR #1178).
Define a new SO_TIMESTAMP socket option for datagram sockets to return
packet-arrival timestamps as control information (PR #1179).

Submitted by: Louis Mamakos <loiue@TransSys.com>


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# be24e9e8 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Changed socket code to use 4.4BSD queue macros. This includes removing
the obsolete soqinsque and soqremque functions as well as collapsing
so_q0len and so_qlen into a single queue length of unaccepted connections.
Now the queue of unaccepted & complete connections is checked directly
for queued sockets. The new code should be functionally equivilent to
the old while being substantially faster - especially in cases where
large numbers of connections are often queued for accept (e.g. http).


# 09bb5f75 24-Feb-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Make getsockopt() capable of handling more than one mbuf worth of data.
Use this to read rules out of ipfw.
Add the lkm code to ipfw.c


# dc915e7c 13-Feb-1996 Garrett Wollman <wollman@FreeBSD.org>

Kill XNS.
While we're at it, fix socreate() to take a process argument. (This
was supposed to get committed days ago...)


# f9827213 28-Jan-1996 John Dyson <dyson@FreeBSD.org>

Enable the new fast pipe code. The old pipes can be used with the
"OLD_PIPE" config option.


# db6a20e2 03-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Converted two options over to the new scheme: USER_LDT and KTRACE.


# 6ee78bf0 01-Jan-1996 Peter Wemm <peter@FreeBSD.org>

Make pipe() return a set of bidirectional pipe fd's rather than one-way only
just like on SVR4.

This has no effect on any current programs in our source, but makes
the use of SVR4 code a little easier. There is no code or implementation
cost in the kernel.. This two-line change merely sets the modes on the ends
of the pipes to be bidirectional. There are no other changes.


# 47daf5d5 14-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Nuked ambiguous sleep message strings:
old: new:
netcls[] = "netcls" "soclos"
netcon[] = "netcon" "accept", "connec"
netio[] = "netio" "sblock", "sbwait"


# 5fdb8324 23-Oct-1995 Bruce Evans <bde@FreeBSD.org>

Simplify the pseudo-argument removal changes by not optimizing for
the !COMPAT_43 case - use a common function even when there is no
`old' function. The diffs for this are large because of code motion
to restore the function order to what it was before the pseudo-argument
changes.

Include <sys/sysproto.h> to get correct args structs and prototypes.
The diffs for this are large because the declarations of the args structs
were moved to become comments in the function headers. The comments may
actually match the automatically generated declarations right now.

Add prototypes.


# 88c94611 11-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Remove the '1' from getpeername1 and getsockname1 when NOT COMPAT_OLDSOCK.
Left it in there by mistake.


# 93c9414e 07-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Remove compat_43 psuedo-argument hack, and replace with a better hack.
Instead of using a fake "compat" argument, pass a real compat int to function
if COMPAT_43 is defined. Functions involved: wait4, accept, recvfrom,
getsockname.

With the compat psuedo-argument, this introduces an argument structure
that can have two possible sizes depending on compat options.
This makes life difficult for lkm modules like ibcs2, which would
have to guess what size used in kernel when compiled. Also,
the prototype generator for these structures cannot generate proper sizes.

Now there is only one fixed structure and makes everybody happy.

I recommend these changes be introduced to 2.1 so that ibcs2, linux
lkm's generated for 2.2 can still run on a 2.1 kernel.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources