History log of /freebsd-current/sys/rpc/svc_vc.c
Revision Date Author Comments
# e205fd31 09-Apr-2024 Gleb Smirnoff <glebius@FreeBSD.org>

rpc: use new macros to lock socket buffers

Fixes: d80a97def9a1db6f07f5d2e68f7ad62b27918947


# eb8ba6fb 04-Jan-2024 Assume-Zhan <assume0701@gapp.nthu.edu.tw>

rpc: Fix typo in comment

Event: Advanced UNIX Programming Course (Fall’23) at NTHU.
Pull Request: https://github.com/freebsd/freebsd-src/pull/995


# 0fac350c 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on getpeername/getsockname

Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.

Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694


# cfb1e929 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on accept(2)

Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.

While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.

Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 1a878807 02-Nov-2023 Rick Macklem <rmacklem@FreeBSD.org>

krpc: Display stats of TLS usage

This patch adds some sysctls:
kern.rpc.unenc.tx_msgcnt
kern.rpc.unenc.tx_msgbytes
kern.rpc.unenc.rx_msgcnt
kern.rpc.unenc.rx_msgbytes
kern.rpc.tls.tx_msgcnt
kern.rpc.tls.tx_msgbytes
kern.rpc.tls.rx_msgcnt
kern.rpc.tls.rx_msgbytes
kern.rpc.tls.handshake_success
kern.rpc.tls.handshake_failed
kern.rpc.tls.alerts
which allow a NFS server sysadmin to determine how much
NFS-over-TLS is being used. A large number of failed
handshakes might also indicate an NFS confirguration
problem.

This patch moves the definition of "kern.rpc" from the
kgssapi module to the krpc module. As such, both modules
need to be rebuilt from sources. Since __FreeBSD_version
was bumped yesterday, I will not bump it again.

Suggested by: gwollman
Discussed on: freebsd-current
MFC after: 1 month


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 564ed8e8 22-Aug-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Allow multiple instances of rpc.tlsservd

During a discussion with someone working on NFS-over-TLS
for a non-FreeBSD platform, we agreed that a single server
daemon for TLS handshakes could become a bottleneck when
an NFS server first boots, if many concurrent NFS-over-TLS
connections are attempted.

This patch modifies the kernel RPC code so that it can
handle multiple rpc.tlsservd daemons. A separate commit
currently under review as D35886 for the rpc.tlsservd
daemon.


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 87d18efe 24-Jul-2022 Dimitry Andric <dim@FreeBSD.org>

Adjust svc_vc_null() definition to avoid clang 15 warning

With clang 15, the following -Werror warning is produced:

sys/rpc/svc_vc.c:1078:12: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
svc_vc_null()
^
void

This is because svc_vc_null() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.

MFC after: 3 days


# 0b4f2ab0 15-May-2022 Rick Macklem <rmacklem@FreeBSD.org>

krpc: Fix NFS-over-TLS for KTLS1.3

When NFS-over-TLS uses KTLS1.3, the client can receive
post-handshake handshake records. These records can be
safely thown away, but are not handled correctly via the
rpctls_ct_handlerecord() upcall to the daemon.

Commit 373511338d95 changed soreceive_generic() so that it
will only return ENXIO for Alert records when MSG_TLSAPPDATA
is specified. As such, the post-handshake handshake
records will be returned to the krpc.

This patch modifies the krpc so that it will throw
these records away, which seems sufficient to make
NFS-over-TLS work with KTLS1.3. This change has
no effect on the use of KTLS1.2, since it does not
generate post-handshake handshake records.

MFC after: 2 weeks


# 6e671ec1 04-Apr-2022 Warner Losh <imp@FreeBSD.org>

svc_vc_rendezvous_stat: eliminiate write only variable stat

Sponsored by: Netflix


# e3ba94d4 09-Nov-2021 John Baldwin <jhb@FreeBSD.org>

Don't require the socket lock for sorele().

Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741


# 7fabaac2 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

rpc: Convert an SOLISTENING check to an assertion

Per the comment, this socket should always be a listening socket.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 20d728b5 09-Jul-2021 Mark Johnston <markj@FreeBSD.org>

rpc: Make function tables const

No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# f4bb1869 14-Jun-2021 Mark Johnston <markj@FreeBSD.org>

Consistently use the SOLISTENING() macro

Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# e1a907a2 11-Jun-2021 Rick Macklem <rmacklem@FreeBSD.org>

krpc: Acquire ref count of CLIENT for backchannel use

Michael Dexter <editor@callfortesting.org> reported
a crash in FreeNAS, where the first argument to
clnt_bck_svccall() was no longer valid.
This argument is a pointer to the callback CLIENT
structure, which is free'd when the associated
NFSv4 ClientID is free'd.

This appears to have occurred because a callback
reply was still in the socket receive queue when
the CLIENT structure was free'd.

This patch acquires a reference count on the CLIENT
that is not CLNT_RELEASE()'d until the socket structure
is destroyed. This should guarantee that the CLIENT
structure is still valid when clnt_bck_svccall() is called.
It also adds a check for closed or closing to
clnt_bck_svccall() so that it will not process the callback
RPC reply message after the ClientID is free'd.

Comments by: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30153


# ab0c29af 21-Aug-2020 Rick Macklem <rmacklem@FreeBSD.org>

Add TLS support to the kernel RPC.

An internet draft titled "Towards Remote Procedure Call Encryption By Default"
describes how TLS is to be used for Sun RPC, with NFS as an intended use case.
This patch adds client and server support for this to the kernel RPC,
using KERN_TLS and upcalls to daemons for the handshake, peer reset and
other non-application data record cases.

The upcalls to the daemons use three fields to uniquely identify the
TCP connection. They are the time.tv_sec, time.tv_usec of the connection
establshment, plus a 64bit sequence number. The time fields avoid problems
with re-use of the sequence number after a daemon restart.
For the server side, once a Null RPC with AUTH_TLS is received, kernel
reception on the socket is blocked and an upcall to the rpctlssd(8) daemon
is done to perform the TLS handshake. Upon completion, the completion
status of the handshake is stored in xp_tls as flag bits and the reply to
the Null RPC is sent.
For the client, if CLSET_TLS has been set, a new TCP connection will
send the Null RPC with AUTH_TLS to initiate the handshake. The client
kernel RPC code will then block kernel I/O on the socket and do an upcall
to the rpctlscd(8) daemon to perform the handshake.
If the upcall is successful, ct_rcvstate will be maintained to indicate
if/when an upcall is being done.

If non-application data records are received, the code does an upcall to
the appropriate daemon, which will do a SSL_read() of 0 length to handle
the record(s).

When the socket is being shut down, upcalls are done to the daemons, so
that they can perform SSL_shutdown() calls to perform the "peer reset".

The rpctlssd(8) and rpctlscd(8) daemons require a patched version of the
openssl library and, as such, will not be committed to head at this time.

Although the changes done by this patch are fairly numerous, there should
be no semantics change to the kernel RPC at this time.
A future commit to the NFS code will optionally enable use of TLS for NFS.


# 91898857 29-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Avoid relying on header pollution from sys/refcount.h.

MFC after: 3 days
Sponsored by: The FreeBSD Foundation


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 779f106a 08-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Listening sockets improvements.

o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.

o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770


# 1b537781 27-May-2016 Enji Cooper <ngie@FreeBSD.org>

Quell false positives in svc_vc_create and svc_vc_create_conn with cd and xprt

Both cd and xprt will be non-NULL after their respective malloc(9) wrappers are
called (mem_alloc and svc_xprt_alloc, which calls mem_alloc) as mem_alloc
always gets called with M_WAITOK|M_ZERO today. Thus, testing for them being
non-NULL is incorrect -- it misleads Coverity and it misleads the reader.

Remove some unnecessary NULL initializations as a follow up to help solidify
the fact that these pointers will be initialized properly in sys/rpc/.. with
the interfaces the way they are currently.

Differential Revision: https://reviews.freebsd.org/D6572
MFC after: 2 weeks
Reported by: Coverity
CID: 1007338, 1007339, 1007340
Reviewed by: markj, truckman
Sponsored by: EMC / Isilon Storage Division


# 7f5b1253 14-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

RPC: for pointers replace 0 with NULL.

These are mostly cosmetical, no functional change.

Found with devel/coccinelle.


# 2c98c61d 18-Aug-2015 Xin LI <delphij@FreeBSD.org>

Set curvnet context inside the RPC code in more places.

Reviewed by: melifaro
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3398


# b4c02146 28-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove useless acquire semantic from the atomic_add operation before
sosend(). The only release on the xp_snt_cnt is done after sosend(),
with an intent to synchronize with load_acq in svc_vc_ack().

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 80867e61 07-Apr-2015 Alexander Motin <mav@FreeBSD.org>

Remove hard limits on number of accepting NFS connections.

Limits of 5 connections set long ago creates problems for SPEC benchmark.
Make the NFS follow system-wide maximum.

MFC after: 1 week


# 84a9ba84 02-Feb-2015 Pedro F. Giffuni <pfg@FreeBSD.org>

rpc: Uninitialized pointer read

Initialize *xprt to avoid exposing a random value
in cleanup_svc_vc_create.
This is the kernel counterpart of r278041.

CID: 1007340


# cfa6009e 12-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# c59e4cc3 01-Jul-2014 Rick Macklem <rmacklem@FreeBSD.org>

Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

MFC after: 1 month


# c809a67a 04-Jan-2014 Alexander Motin <mav@FreeBSD.org>

Replace locks added in r260229 to protect sequence counters with atomics.

New algorithm does not create additional lock congestion, while some races
it includes should not be a problem. Those races may keep requests in DRC
cache for some more time by returning ACK position smaller then actual,
but it still should be able to drop thems when proper ACK finally read.

Races of the original algorithm based on TCP seq number were worse because
they happened when reply sequence number were recorded. After that even
correctly read ACKs could not clean DRC sometimes.


# d473bac7 03-Jan-2014 Alexander Motin <mav@FreeBSD.org>

Rework NFS Duplicate Request Cache cleanup logic.

- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by: iXsystems, Inc.


# 5c42b9dc 29-Dec-2013 Alexander Motin <mav@FreeBSD.org>

Introduce xprt_inactive_self() -- variant for use when sure that port
is assigned to thread. For example, withing receive handlers. In that
case the function reduces to single assignment and can avoid locking.


# 4a240f6c 28-Dec-2013 Alexander Motin <mav@FreeBSD.org>

In addition to r259632 completely block receive upcalls if we have more
data than we need. This reduces lock pressure from xprt_active() side.


# 679659ad 24-Dec-2013 Alexander Motin <mav@FreeBSD.org>

Fix a bug introduced at r259632, triggering infinite loop in some cases.


# 7455eb71 19-Dec-2013 Alexander Motin <mav@FreeBSD.org>

Rework flow control for connection-oriented (TCP) RPC server.

When processing receive buffer, write the amount of data, expected
in present request record, into socket's so_rcv.sb_lowat to make stack
aware about our needs. When processing following upcalls, ignore them
until socket collect enough data to be read and processed in one turn.
This change reduces number of context switches and other operations
in RPC stack during large NFS writes (especially via non-Jumbo networks)
by order of magnitude.

After precessing current packet, take another look into the pending
buffer to find out whether the next packet had been already received.
If not, deactivate this port right there without making RPC code to
push this port to another thread just to find that there is nothing.
If the next packet is received partially, also deactivate the port, but
also update socket's so_rcv.sb_lowat to not be woken up prematurely.
This change additionally reduces number of context switches per NFS
request about in half.


# 2e322d37 25-Nov-2013 Hiroki Sato <hrs@FreeBSD.org>

Replace Sun RPC license in TI-RPC library with a 3-clause BSD license,
with the explicit permission of Sun Microsystems in 2009.


# dad14216 08-Apr-2013 John Baldwin <jhb@FreeBSD.org>

Fix a potential socket leak in the NFS server. If a client closes its
connection after it was accepted by the userland nfsd process but before
it was handled off to svc_vc_create() in the kernel, then svc_vc_create()
would see it as a new listen socket and try to listen on it leaving a
dangling reference to the socket. Instead, check for disconnected sockets
and treat them like a connected socket. The call to pru_getaddr() should
fail and cause svc_vc_create() to fail. Note that we need to lock the
socket to get a consistent snapshot of so_state since there is a window
in soisdisconnected() where both flags are clear.

Reviewed by: dfr, rmacklem
MFC after: 1 week


# bd54830b 11-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Use m_get(), m_gethdr() and m_getcl() instead of historic macros.

Sponsored by: Nginx, Inc.


# e2adc47d 07-Dec-2012 Rick Macklem <rmacklem@FreeBSD.org>

Add support for backchannels to the kernel RPC. Backchannels
are used by NFSv4.1 for callbacks. A backchannel is a connection
established by the client, but used for RPCs done by the server
on the client (callbacks). As a result, this patch mixes some
client side calls in the server side and vice versa. Some
definitions in the .c files were extracted out into a file called
krpc.h, so that they could be included in multiple .c files.
This code has been in projects/nfsv4.1-client for some time.
Although no one has given it a formal review, I believe kib@
has taken a look at it.


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 1fb51a12 16-Feb-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks


# 2a1e0fb4 10-Jan-2011 Rick Macklem <rmacklem@FreeBSD.org>

Fix a bug in the client side krpc where it was, sometimes
erroneously, assumed that 4 bytes of data were in the first
mbuf of a list by replacing the bcopy() with m_copydata().
Also, replace the uses of m_pullup(), which can fail for
reasons other than not enough data, with m_copydata().
For the cases where it isn't known that there is enough
data in the mbuf list, check first via m_len and m_length().
This is believed to fix a problem reported by dpd at dpdtech.com
and george+freebsd at m5p.com.

Reviewed by: jhb
MFC after: 8 days


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 83864c81 28-Aug-2009 Marko Zec <zec@FreeBSD.org>

MFC r196503:

Fix NFS panics with options VIMAGE kernels by apropriately setting curvnet
context inside the RPC code.

Temporarily set td's cred to mount's cred before calling socreate() via
__rpc_nconf2socket().

Submitted by: rmacklem (in part)
Reviewed by: rmacklem, rwatson
Discussed with: dfr, bz
Approved by: re (rwatson), julian (mentor)

Approved by: re (rwatson)


# 0348c661 24-Aug-2009 Marko Zec <zec@FreeBSD.org>

Fix NFS panics with options VIMAGE kernels by apropriately setting curvnet
context inside the RPC code.

Temporarily set td's cred to mount's cred before calling socreate() via
__rpc_nconf2socket().

Submitted by: rmacklem (in part)
Reviewed by: rmacklem, rwatson
Discussed with: dfr, bz
Approved by: re (rwatson), julian (mentor)
MFC after: 3 days


# 6b97c9f0 17-Jun-2009 Rick Macklem <rmacklem@FreeBSD.org>

Since svc_[dg|vc|tli|tp]_create() did not hold a reference count on the
SVCXPTR structure returned by them, it was possible for the structure
to be free'd before svc_reg() had been completed using the structure.
This patch acquires a reference count on the newly created structure
that is returned by svc_[dg|vc|tli|tp]_create(). It also
adds the appropriate SVC_RELEASE() calls to the callers, except the
experimental nfs subsystem. The latter will be committed separately.

Submitted by: dfr
Tested by: pho
Approved by: kib (mentor)


# 0da4382a 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Correct MAC compile problems resulting from the new RPC code copying and
pasting code from the general socket code without also bringing along
required opt_mac.h includes.


# a4fa5e6d 04-Jun-2009 Rick Macklem <rmacklem@FreeBSD.org>

Fix two races in the server side krpc w.r.t upcalls:
Add a flag so that soupcall_clear() is only called once to cancel
an upcall.
Move the test for xprt_registered in the upcall down to after the
mtx_lock() of the pool mutex, to catch the case where it is
unregistered while the upcall is waiting for the mutex.
Also, move the mtx_destroy() of the pool mutex to after SVC_RELEASE(),
so that it isn't destroyed before the upcalls are disabled.

Reviewed by: dfr, jhb
Tested by: pho
Approved by: kib (mentor)


# f93bfb23 02-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from: TrustedBSD Project


# 74fb0ba7 01-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# a9148abd 03-Nov-2008 Doug Rabson <dfr@FreeBSD.org>

Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by: Isilon Systems
MFC after: 1 month


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# c675522f 26-Jun-2008 Doug Rabson <dfr@FreeBSD.org>

Re-implement the client side of rpc.lockd in the kernel. This implementation
provides the correct semantics for flock(2) style locks which are used by the
lockf(1) command line tool and the pidfile(3) library. It also implements
recovery from server restarts and ensures that dirty cache blocks are written
to the server before obtaining locks (allowing multiple clients to use file
locking to safely share data).

Sponsored by: Isilon Systems
PR: 94256
MFC after: 2 weeks


# ee31b83a 28-Mar-2008 Doug Rabson <dfr@FreeBSD.org>

Minor changes to improve compatibility with older FreeBSD releases.


# dfdcada3 26-Mar-2008 Doug Rabson <dfr@FreeBSD.org>

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks