History log of /freebsd-10.1-release/sys/sys/socketvar.h
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 272461 02-Oct-2014 gjb

Copy stable/10@r272459 to releng/10.1 as part of
the 10.1-RELEASE process.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 255608 16-Sep-2013 kib

Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.

Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)


# 252356 28-Jun-2013 davide

- Trim an unused and bogus Makefile for mount_smbfs.
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
switch.


# 250300 06-May-2013 andre

Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.


# 249327 10-Apr-2013 glebius

Fix build.


# 247667 02-Mar-2013 pjd

- Implement two new system calls:

int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des


# 244022 08-Dec-2012 pjd

The socket_zone UMA zone is now private to uipc_socket.c.


# 241680 18-Oct-2012 attilio

Disconnect non-MPSAFE SMBFS from the build in preparation for dropping
GIANT from VFS. In addition, disconnect also netsmb, which is a base
requirement for SMBFS.

In the while SMBFS regular users can use FUSE interface and smbnetfs
port to work with their SMBFS partitions.

Also, there are ongoing efforts by vendor to support in-kernel smbfs,
so there are good chances that it will get relinked once properly locked.

This is not targeted for MFC.


# 215178 12-Nov-2010 luigi

This commit implements the SO_USER_COOKIE socket option, which lets
you tag a socket with an uint32_t value. The cookie can then be
used by the kernel for various purposes, e.g. setting the skipto
rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has
been implemented; however there is nothing ipfw-specific in its
implementation).

The ipfw-related code that uses the optopn will be committed separately.

This change adds a field to 'struct socket', but the struct is not
part of any driver or userland-visible ABI so the change should be
harmless.

See the discussion at
http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html

Idea and code from Paul Joe, small modifications and manpage
changes by myself.

Submitted by: Paul Joe
MFC after: 1 week


# 212822 18-Sep-2010 rwatson

With reworking of the socket life cycle in 7.x, the need for a "sotryfree()"
was eliminated: all references to sockets are explicitly managed by sorele()
and the protocols. As such, garbage collect sotryfree(), and update
sofree() comments to make the new world order more clear.

MFC after: 3 days
Reported by: Anuranjan Shukla <anshukla at juniper dot net>


# 210226 18-Jul-2010 trasz

Revert r210225 - turns out I was wrong; the "/*-" is not license-only
thing; it's also used to indicate that the comment should not be automatically
rewrapped.

Explained by: cperciva@


# 210225 18-Jul-2010 trasz

The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright
occurences from sys/sys/ and sys/kern/.


# 195423 07-Jul-2009 kib

Fix poll(2) and select(2) for named pipes to return "ready for read"
when all writers, observed by reader, exited. Use writer generation
counter for fifo, and store the snapshot of the fifo generation in the
f_seqcount field of struct file, that is otherwise unused for fifos.
Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is
equal to fifo' fi_wgen, and revert r89376.

Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them.
Note that the patch does not fix not returning POLLHUP for fifos.

PR: kern/94772
Submitted by: bde (original version)
Reviewed by: rwatson, jilles
Approved by: re (kensmith)
MFC after: 6 weeks (might be)


# 194672 22-Jun-2009 andre

Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
the length of data to be moved into the uio compared to
soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams. Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets. Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4: r164817 //depot/user/andre/soreceive_stream/


# 193272 01-Jun-2009 jhb

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# 191688 30-Apr-2009 zec

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# 181066 31-Jul-2008 kmacy

move sockbuf locking macros in to sockbuf.h


# 180948 29-Jul-2008 kmacy

Factor sockbuf, sockopt, and sockstate out of socketvar.h in to separate headers.

Reviewed by: rwatson
MFC after: 3 days


# 180198 02-Jul-2008 rwatson

Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by: kris, ps


# 179027 15-May-2008 gnn

Update the kernel to count the number of mbufs and clusters
(all types) used per socket buffer.

Add support to netstat to print out all of the socket buffer
statistics.

Update the netstat manual page to describe the new -x flag
which gives the extended output.

Reviewed by: rwatson, julian


# 178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 175968 04-Feb-2008 rwatson

Further clean up sorflush:

- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
and destroying a socket buffer lock for the temporary stack copy of the
actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
things have gotten less ugly in sorflush() lately.

This makes socket close cleaner, and possibly also marginally faster.

MFC after: 3 weeks


# 175845 31-Jan-2008 rwatson

Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>


# 174711 17-Dec-2007 kmacy

Add SB_NOCOALESCE flag to disable socket buffer update in place


# 174647 16-Dec-2007 jeff

Refactor select to reduce contention and hide internal implementation
details from consumers.

- Track individual selecters on a per-descriptor basis such that there
are no longer collisions and after sleeping for events only those
descriptors which triggered events must be rescaned.
- Protect the selinfo (per descriptor) structure with a mtx pool mutex.
mtx pool mutexes were chosen to preserve api compatibility with
existing code which does nothing but bzero() to setup selinfo
structures.
- Use a per-thread wait channel rather than a global wait channel.
- Hide select implementation details in a seltd structure which is
opaque to the rest of the kernel.
- Provide a 'selsocket' interface for those kernel consumers who wish to
select on a socket when they have no fd so they no longer have to
be aware of select implementation details.

Tested by: kris
Reviewed on: arch


# 169236 03-May-2007 rwatson

sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.


# 167717 19-Mar-2007 andre

Space to tab in SB_* defines to match with rest of file.


# 167715 19-Mar-2007 andre

Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.

Adjust tcp_output() make use of it.

Tested by: gallatin


# 166404 01-Feb-2007 andre

Generic socket buffer auto sizing support, header defines, flag inheritance.

MFC after: 1 month


# 160875 01-Aug-2006 rwatson

Reimplement socket buffer tear-down in sofree(): as the socket is no
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc. Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer. A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.

This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.


# 160619 24-Jul-2006 rwatson

soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod: sam, gnn, wollman


# 160605 23-Jul-2006 rwatson

Tweak so_gencnt comment: it was once last, but that is no longer the
case.


# 160604 23-Jul-2006 rwatson

Tweak comment for so_head: it is a pointer to the listen socket, rather
than the accept socket.


# 159706 17-Jun-2006 rwatson

Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do. No current protocols
invoke these functions, and removing them will help discourage them
from ever being used. I should have removed them years ago, but
lost track of it.

MFC after: 1 week
Prodded almost by accident by: peter


# 159481 10-Jun-2006 rwatson

Move some functions and definitions from uipc_socket2.c to uipc_socket.c:

- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after: 1 month


# 157360 01-Apr-2006 rwatson

Add a comment describing SS_PROTOREF in detail. This will eventually be
in socket(9).

MFC after: 3 months


# 156763 16-Mar-2006 rwatson

Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.


# 156746 15-Mar-2006 rwatson

Correct spelling of 0x4000 in previous commit. This one line change from
a 42k patch seemed easier to retype than apply, but apparently not. :-)

Submitted by: pjd


# 156737 15-Mar-2006 rwatson

Add SS_PROTOREF socket flag, which represents a strong reference by the
protocol to the socket. Normally protocol references are weak: that is,
the socket layer can tear down the socket (and hence protocol state)
when it finds convenient. This flag will allow the protocol to
explicitly declare to the socket layer that it is maintaining a
strong reference, rather than the current implicit model associated
with so_pcb pointer values and repeated attempts to possibly free the
socket.


# 154294 13-Jan-2006 rwatson

Add sosend_dgram(), a greatly reduced and simplified version of sosend()
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin(). Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
follow-up.


# 151967 02-Nov-2005 andre

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 151888 30-Oct-2005 rwatson

Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.


# 147852 09-Jul-2005 jhb

Document that SOCK_LOCK is used to protect so_emuldata.

Approved by: re (scottl)


# 143463 12-Mar-2005 rwatson

Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option
from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it
now matches do_setopt_accept_filter(). Slightly reformulate the logic to
match the optimistic allocation of storage for the argument in advance,
and slightly expand the coverage of the socket lock.


# 142190 21-Feb-2005 rwatson

In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn


# 142058 18-Feb-2005 rwatson

Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where
the rest of the accept filter code currently lives.

MFC after: 3 days


# 141052 30-Jan-2005 glebius

Move sb_state to the beginning of structure, above sb_startzero member.
sb_state shouldn't be erased, when socket buffer is flushed by sorflush().

When sb_state was bzero'ed, a recently set SBS_CANTRCVMORE flag was cleared.
If a socket was shutdown(SHUT_RD), a subsequent read() would block on it.

Reported by: Ed Maste, Gerrit Nagelhout
Reviewed by: rwatson


# 140730 24-Jan-2005 glebius

- Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from
short to unsigned short.
- Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX.

Before this change setting somaxconn to smth above 32767 and calling
listen(fd, -1) lead to a socket, which doesn't accept connections at all.

Reviewed by: rwatson
Reported by: Igor Sysoev


# 136682 18-Oct-2004 rwatson

Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


# 136326 09-Oct-2004 rwatson

Add SOCKBUF_UNLOCK_ASSERT(), which asserts that the current thread does
not hold the mutex for a socket buffer.


# 132060 12-Jul-2004 dwmalone

Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by: alfred


# 132018 12-Jul-2004 alfred

Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama


# 131168 27-Jun-2004 rwatson

Annotate so_gencnt field of struct socket locked by so_global_mtx.


# 131029 24-Jun-2004 rwatson

Annotate so_error as being used for simple assignment and reads, and
therefore not locked.

Assert the socket buffer lock in sowwakeup_locked() to match
sorwakeup_locked().


# 131022 24-Jun-2004 rwatson

Annotate which SB_ constants are for sb_flags fields.


# 131017 24-Jun-2004 rwatson

Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations


# 131006 23-Jun-2004 rwatson

Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.


# 130831 20-Jun-2004 rwatson

Merge next step in socket buffer locking:

- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().

- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().

- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 130818 20-Jun-2004 rwatson

Annotate so_state as locked with SOCK_LOCK(so).

Add a commenting indicating that the SB_ constants apply to sb_flags.


# 130527 15-Jun-2004 rwatson

Fill in locking annotation for additional socket fields:

so_timeo Used as a sleep/wakeup address, no locking.
sb_* Almost all socket buffer fields locked with
sockbuf lock for the oskcet buffer.
so_cred Static after socket creation.


# 130492 14-Jun-2004 rwatson

Remove unneeded '-' from comment header; this comment contains only
English text paragraphs that shouldn't have problems when run through
indent.


# 130480 14-Jun-2004 rwatson

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 130398 13-Jun-2004 rwatson

Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.


# 130393 12-Jun-2004 rwatson

Move #ifdef _KERNEL higher in socketvar.h to cover various socket
buffer related macros.


# 130387 12-Jun-2004 rwatson

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 130383 12-Jun-2004 rwatson

Whitespace-only restyling of socket reference count macros.


# 130380 12-Jun-2004 rwatson

Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer. Add accessor macros to use the
mutex (SOCKBUF_*()). Initialize the mutex in soalloc(), and destroy
it in sodealloc(). Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by: sam
Sponosored by: FreeBSD Foundation
Obtained from: BSD/OS


# 130364 11-Jun-2004 rwatson

Use tabs instead of spaces between #define and macro name; a merge
mistake as they are in rwatson_netperf.


# 130050 04-Jun-2004 rwatson

Mark sun_noname as const since it's immutable. Update definitions
of functions that potentially accept &sun_noname (sbappendaddr(),
et al) to accept a const sockaddr pointer.


# 129979 02-Jun-2004 rwatson

Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

so_qlen so_incqlen so_qstate
so_comp so_incomp so_list
so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.

- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.

- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.


# 129958 01-Jun-2004 rwatson

Replace current locking comments for struct socket/struct sockbuf
with new ones. Annotate constant-after-creation fields as such. The
comments describe a number of locks that are not yet merged.


# 129916 01-Jun-2004 rwatson

The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.


# 127976 07-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 126425 01-Mar-2004 rwatson

Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by: sam


# 126411 29-Feb-2004 rwatson

Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1. This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by: sam


# 122780 16-Nov-2003 alc

- Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
- Move the sf_buf API to its own header file; make struct sf_buf's
definition machine dependent. In this commit, we remove an
unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
Ultimately, we may eliminate struct sf_buf on those architecures
except as an opaque pointer that references a vm page.


# 122524 12-Nov-2003 rwatson

Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 121628 28-Oct-2003 sam

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


# 118453 04-Aug-2003 hsu

Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


# 117708 17-Jul-2003 robert

To avoid a kernel panic provoked by a NULL pointer dereference,
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().

The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it. knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).

PR: kern/54331


# 112778 29-Mar-2003 alc

Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part
of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it.

The ultimate reason for this change is to enable two optimizations:
(1) that there never be more than one sf_buf mapping a vm_page at a time
and (2) 64-bit architectures can transparently use their 1-1 virtual
to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter()
and pmap_qremove().


# 111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109098 11-Jan-2003 tjr

Don't count mbufs with m_type == MT_HEADER or MT_OOBDATA as control data
in sballoc(), sbcompress(), sbdrop() and sbfree(). Fixes fstat() st_size
reporting and kevent() EVFILT_READ on TCP sockets.


# 108242 23-Dec-2002 phk

Move the declaration of the socket fileops from socketvar.h to file.h.
This allows us to use the new typedefs and removes the needs for a number
of forward struct declarations in socketvar.h


# 108235 23-Dec-2002 phk

s/sokqfilter/soo_kqfilter/ for consistency with the naming of all
other socket/file operations.


# 106326 02-Nov-2002 alc

Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing
a panic because the socket's state isn't as expected by sofree().

Discussed with: dillon, fenner


# 106313 01-Nov-2002 kbyanc

Track the number of non-data chararacters stored in socket buffers so that
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.

PR: 30634
Reviewed by: -net, -arch
Sponsored by: NTT Multimedia Communications Labs
MFC after: 2 weeks


# 102003 17-Aug-2002 rwatson

In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101983 16-Aug-2002 rwatson

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101848 13-Aug-2002 rwatson

Move to a nested include of _label.h instead of mac.h in sys/sys/*.h
(Most of the places where mac.h was recursively included from another
kernel header file. net/netinet to follow.)

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Suggested by: bde


# 101842 13-Aug-2002 dg

Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h
so that they can be seen by external callers.


# 100989 30-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible kernel
access control.

Label socket IPC objects, permitting security features to be maintained
at the granularity of the socket. Two labels are stored for each
socket: the label of the socket itself, and a cached peer label
permitting interogation of the remote endpoint. Since socket locking
is not yet present in the base tree, these objects are not locked,
but are assumed to follow the same semantics as other modifiable
entries in the socket structure.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100764 27-Jul-2002 rwatson

Remote socheckproc(), which was removed when p_can*() was introduced
ages ago. The prototype was missed.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100591 24-Jul-2002 jdp

Widen struct sockbuf's sb_timeo member to int from short. With
non-default but reasonable values of hz this member overflowed,
breaking NFS over UDP.

Also, as long as I'm plowing up struct sockbuf ... Change certain
members from u_long/long to u_int/int in order to reduce wasted
space on 64-bit machines. This change was requested by Andrew
Gallatin.

Netstat and systat need to be rebuilt. I am incrementing
__FreeBSD_version in case any ports need to change.


# 99009 28-Jun-2002 alfred

More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.


# 98993 28-Jun-2002 alfred

change struct socket -> so_pcb from caddr_t to void *.


# 98849 26-Jun-2002 ken

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# 98385 18-Jun-2002 tanimura

Remove so*_locked(), which were backed out by mistake.


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 95959 02-May-2002 alfred

Cleanup, quote:

This leaves some vestiges of the old locking, including style
bugs in it. I've only noticed anachronisms in socketvar.h so far
(I've merged net* but not kern or all of sys). The patch also
has old fixes for style bugs in accf stuff and namespace pollution
in uma... The largest style bugs are line continued backslashes
in column 80 and (these are fixed), and starting the do-while
code for the new macros in column 40, which is quite unlike the
usual indentation (see sys/queue.h) and not even like the indentation
for the old macros (column 32) (this is not fixed).

Submitted by: bde


# 95883 01-May-2002 alfred

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# 95759 29-Apr-2002 tanimura

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


# 95552 27-Apr-2002 tanimura

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


# 95343 24-Apr-2002 silby

Remove sodropablereq - this function hasn't been used since the
syncache went in.

MFC after: 3 days


# 94160 08-Apr-2002 hsu

There's only one socket zone so we don't need to remember it
in every socket structure.


# 93219 26-Mar-2002 bde

Removed some namespace pollution (unnecessary nested includes).


# 93008 23-Mar-2002 bde

Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.


# 92754 20-Mar-2002 jeff

Backout part of my previous commit; I was wrong about vm_zone's handling of
limits on zones w/o objects.


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 92719 19-Mar-2002 alfred

Remove __P


# 92654 19-Mar-2002 jeff

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 89318 13-Jan-2002 alfred

Add parens around macro args.

Forgotten by: dillon


# 89111 09-Jan-2002 alfred

holdsock is gone, remove the prototype


# 88739 31-Dec-2001 rwatson

o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread
argument.

o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by: freebsd-arch


# 87822 13-Dec-2001 green

Remove stale prototype for sonewconn3().


# 86487 17-Nov-2001 dillon

Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.


# 85453 25-Oct-2001 rwatson

o Remove extern showallsockets, defunct as of the change to
kern.security.seeotheruids_permitted. This was missed in the
commit that made this change elsewhere.


# 84527 05-Oct-2001 ps

Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks


# 83421 13-Sep-2001 obrien

Re-apply rev 1.178 -- style(9) the structure definitions.
I have to wonder how many other changes were lost in the KSE mildstone 2 merge.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 83045 04-Sep-2001 obrien

style(9) the structure definitions.


# 82656 31-Aug-2001 jlemon

Whitespace change.


# 78913 28-Jun-2001 jlemon

Correct comment: so_q -> so_comp, so_q0 -> so_incomp.

Submitted by: Adagio Vangogh <adagio_v@pacbell.net>


# 72521 15-Feb-2001 jlemon

Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter


# 70834 09-Jan-2001 wollman

select() DKI is now in <sys/selinfo.h>.


# 70536 31-Dec-2000 phk

Use macro API to <sys/queue.h>


# 68883 18-Nov-2000 dillon

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 65534 06-Sep-2000 alfred

Accept filter maintainance

Update copyrights.

Introduce a new sysctl node:
net.inet.accf

Although acceptfilters need refcounting to be properly (safely) unloaded
as a temporary hack allow them to be unloaded if the sysctl
net.inet.accf.unloadable is set, this is really for developers who want
to work on thier own filters.

A near complete re-write of the accf_http filter:
1) Parse check if the request is HTTP/1.0 or HTTP/1.1 if not dump
to the application.
Because of the performance implications of this there is a sysctl
'net.inet.accf.http.parsehttpversion' that when set to non-zero
parses the HTTP version.
The default is to parse the version.
2) Check if a socket has filled and dump to the listener
3) optimize the way that mbuf boundries are handled using some voodoo
4) even though you'd expect accept filters to only be used on TCP
connections that don't use m_nextpkt I've fixed the accept filter
for socket connections that use this.

This rewrite of accf_http should allow someone to use them and maintain
full HTTP compliance as long as net.inet.accf.http.parsehttpversion is
set.


# 64837 19-Aug-2000 dwmalone

Replace the mbuf external reference counting code with something
that should be better.

The old code counted references to mbuf clusters by using the offset
of the cluster from the start of memory allocated for mbufs and
clusters as an index into an array of chars, which did the reference
counting. If the external storage was not a cluster then reference
counting had to be done by the code using that external storage.

NetBSD's system of linked lists of mbufs was cosidered, but Alfred
felt it would have locking issues when the kernel was made more
SMP friendly.

The system implimented uses a pool of unions to track external
storage. The union contains an int for counting the references and
a pointer for forming a free list. The reference counts are
incremented and decremented atomically and so should be SMP friendly.
This system can track reference counts for any sort of external
storage.

Access to the reference counting stuff is now through macros defined
in mbuf.h, so it should be easier to make changes to the system in
the future.

The possibility of storing the reference count in one of the
referencing mbufs was considered, but was rejected 'cos it would
often leave extra mbufs allocated. Storing the reference count in
the cluster was also considered, but because the external storage
may not be a cluster this isn't an option.

The size of the pool of reference counters is available in the
stats provided by "netstat -m".

PR: 19866
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: alfred (glanced at by others on -net)


# 61837 19-Jun-2000 alfred

return of the accept filter part II

accept filters are now loadable as well as able to be compiled into
the kernel.

two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)

Reviewed by: jmg


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 59288 16-Apr-2000 jlemon

Introduce kqueue() and kevent(), a kernel event notification facility.


# 55943 14-Jan-2000 jasone

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# 55205 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 53541 22-Nov-1999 shin

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 52984 08-Nov-1999 peter

Update socket file type for fo_stat(). soo_stat() becomes a fileops
switch entry point rather than being used externally with knowledge of the
internals of the DTYPE_SOCKET f_data contents.


# 52070 09-Oct-1999 green

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


# 51418 19-Sep-1999 green

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# 51381 19-Sep-1999 green

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 47992 17-Jun-1999 green

Reviewed by: the cast of thousands

This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.


# 45311 04-Apr-1999 dt

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# 43512 01-Feb-1999 newton

Moved prototypes for soo_{read,write,close} into socketvar.h where they
belong.

Suggested by: bde


# 43458 31-Jan-1999 bde

Fixed smashed tabs and inconstent comment style in previous commit.


# 43409 30-Jan-1999 newton

Changed struct socket to include a new field (at the end, so as not
to break existing software) acting as a pointer to emulator-specific
state data that some emulators may (or may not) need to maintain about
a socket.

Used by the svr4 module as a place for maintaining STREAMS emulation state.

Discussed with: Mike Smith, Garrett Wollman back in Sept 98


# 43196 25-Jan-1999 fenner

Port NetBSD's 19990120-accept bug fix. This works around the race condition
where select(2) can return that a listening socket has a connected socket
queued, the connection is broken, and the user calls accept(2), which then
blocks because there are no connections queued.

Reviewed by: wollman
Obtained from: NetBSD
(ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)


# 41087 11-Nov-1998 truckman

I got another batch of suggestions for cosmetic changes from bde.


# 41086 11-Nov-1998 truckman

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 40931 05-Nov-1998 dg

Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.


# 38482 23-Aug-1998 wollman

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# 36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 36527 31-May-1998 peter

Have the sorwakeup and sowwakeup check the upcall flags.

Obtained from: NetBSD


# 36079 15-May-1998 wollman

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 33955 01-Mar-1998 guido

Make sure that you can only bind a more specific address when it is
done by the same uid.
Obtained from: OpenBSD


# 32995 01-Feb-1998 bde

Forward declare more structs that are used in prototypes here - don't
depend on <sys/types.h> forward declaring common ones.


# 31927 21-Dec-1997 bde

Moved some declarations from <sys/socket.h> to the correct places, and
fixed everything that depended on them being misplaced.


# 29350 14-Sep-1997 peter

Update interfaces for poll()


# 28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 27531 19-Jul-1997 fenner

Remove sonewconn() macro kludge, introduced in 4.3-Reno to catch argument
mismatches. Prototypes do a much better job these days.

Noticed by: bde


# 25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 19670 12-Nov-1996 bde

Added missing prototype for new function sbcreatecontrol().

Should be in 2.2.


# 18787 07-Oct-1996 pst

Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com


# 15492 30-Apr-1996 bde

Made this self-sufficent (apart from <sys/types.h>) again. It included
<sys/stat.h> and <sys/filedesc.h> just to get struct tags and depended
on a previous #include for <sys/queue.h>


# 14547 11-Mar-1996 dg

Changed socket code to use 4.4BSD queue macros. This includes removing
the obsolete soqinsque and soqremque functions as well as collapsing
so_q0len and so_qlen into a single queue length of unaccepted connections.
Now the queue of unaccepted & complete connections is checked directly
for queued sockets. The new code should be functionally equivilent to
the old while being substantially faster - especially in cases where
large numbers of connections are often queued for accept (e.g. http).


# 14505 11-Mar-1996 hsu

Merge in Lite2: clean up function prototypes.
Did not accept change of second argument to ioctl from int to u_long.
Reviewed by: davidg & bde


# 14093 13-Feb-1996 wollman

Kill XNS.
While we're at it, fix socreate() to take a process argument. (This
was supposed to get committed days ago...)


# 13765 30-Jan-1996 mpp

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# 12843 14-Dec-1995 bde

Nuked ambiguous sleep message strings:
old: new:
netcls[] = "netcls" "soclos"
netcon[] = "netcon" "accept", "connec"
netio[] = "netio" "sblock", "sbwait"


# 12819 14-Dec-1995 phk

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# 12453 21-Nov-1995 bde

Completed function declarations and/or added prototypes.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 3304 02-Oct-1994 phk

Prototypes, prototypes and even more prototypes. Not quite done yet, but
getting closer all the time.


# 2165 21-Aug-1994 paul

Made them all idempotent.
Reviewed by:
Submitted by:


# 2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 1817 02-Aug-1994 dg

Added $Id$


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources