History log of /freebsd-current/sys/fs/nfsserver/nfs_nfsdstate.c
Revision Date Author Comments
# e2c9fad2 04-Jun-2024 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Fix delegation handled for atomic upgrade

For NFSv4.1/4.2, an atomic upgrade of a delegation from a
read delegation to a write delegation is allowed and can
result in signoficantly improved performance.

This patch adds support for this atomic upgrade, plus fixes
a couple of other delegation related bugs. Since there were
three cases where delegations were being issued, the patch
factors this out into a separate function called
nfsrv_issuedelegations().

This patch should only affect the NFSv4.1/4.2 behaviour
when delegations are enabled, which is not the default.

MFC after: 1 month


# 54c3aa02 25-Apr-2024 Rick Macklem <rmacklem@FreeBSD.org>

Revert "nfsd: Fix NFSv4.1/4.2 Claim_Deleg_Cur_FH"

This reverts commit f300335d9aebf2e99862bf783978bd44ede23550.

It turns out that the old code was correct and it was wireshark
that was broken and indicated that the RPC's XDR was bogus.
Found during IETF bakeathon testing this week.


# f300335d 19-Oct-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Fix NFSv4.1/4.2 Claim_Deleg_Cur_FH

When I implemented a test patch using Open Claim_Deleg_Cur_FH
I discovered that the NFSv4.1/4.2 server was broken for this
Open option. Fortunately it is never used by the FreeBSD
client and never used by other clients unless delegations
are enabled. (The FreeBSD NFSv4 server does not have delegations
enabled by default.)

Claim_Deleg_Cur_FH was broken because the code mistakenly
assumed a stateID argument, which is not the case.
This patch fixes the bug by changing the XDR parser to not
expect a stateID and to fill most of the stateID in from the
clientID. The clientID is the first two elements of the "other"
array for the stateID and is sufficient to identify which
client the delegation is issued to. Since there is only one
delegation issued to a client per file, this is sufficient to
locate the correct delegation.

If you are running non-FreeBSD NFSv4.1/4.2 mounts against the
FreeBSD server, you need this patch if you have delegations enabled.

PR: 274574
MFC after: 2 weeks


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 11892bc7 02-Aug-2023 Gordon Bergling <gbe@FreeBSD.org>

nfsserver: Fix a typo in a source code comment

- s/restared/restarted/

MFC after: 3 days


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# ff2f1f69 07-Apr-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Add support for the SP4_MACH_CRED case in ExchangeID

Commit f4179ad46fa4 added support for operation bitmaps for
NFSv4.1/4.2. This commit uses those to implement the SP4_MACH_CRED
case for the NFSv4.1/4.2 ExchangeID operation since the Linux
NFSv4.1/4.2 client is now using this for Kerberized mounts.
The Linux Kerberized NFSv4.1/4.2 mounts currently work without
support for this because Linux will fall back to SP4_NONE,
but there is no guarantee this fallback will work forever.

This commit only affects Kerberized NFSv4.1/4.2 mounts from
Linux at this time.

MFC after: 3 months


# 695d87ba 28-Mar-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfscl: Make coverity happy

Coverity does not like code that checks a function's
return value sometimes. Add "(void)" in front of the
function when the return value does not matter to try
and make it happy.

A recent commit deleted "(void)"s in front of nfsm_fhtom().
This commit puts them back in.

Reported by: emaste
MFC after: 3 months


# 896516e5 16-Mar-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfscl: Add a new NFSv4.1/4.2 mount option for Kerberized mounts

Without this patch, a Kerberized NFSv4.1/4.2 mount must provide
a Kerberos credential for the client at mount time. This credential
is typically referred to as a "machine credential". It can be
created one of two ways:
- The user (usually root) has a valid TGT at the time the mount
is done and this becomes the machine credential.
There are two problems with this.
1 - The user doing the mount must have a valid TGT for a user
principal at mount time. As such, the mount cannot be put
in fstab(5) or similar.
2 - When the TGT expires, the mount breaks.
- The client machine has a service principal in its default keytab
file and this service principal (typically called a host-based
initiator credential) is used as the machine credential.
There are problems with this approach as well:
1 - There is a certain amount of administrative overhead creating
the service principal for the NFS client, creating a keytab
entry for this principal and then copying the keytab entry
into the client's default keytab file via some secure means.
2 - The NFS client must have a fixed, well known, DNS name, since
that FQDN is in the service principal name as the instance.

This patch uses a feature of NFSv4.1/4.2 called SP4_NONE, which
allows the state maintenance operations to be performed by any
authentication mechanism, to do these operations via AUTH_SYS
instead of RPCSEC_GSS (Kerberos). As such, neither of the above
mechanisms is needed.

It is hoped that this option will encourage adoption of Kerberized
NFS mounts using TLS, to provide a more secure NFS mount.

This new NFSv4.1/4.2 mount option, called "syskrb5" must be used
with "sec=krb5[ip]" to avoid the need for either of the above
Kerberos setups to be done by the client.

Note that all file access/modification operations still require
users on the NFS client to have a valid TGT recognized by the
NFSv4.1/4.2 server. As such, this option allows, at most, a
malicious client to do some sort of DOS attack.

Although not required, use of "tls" with this new option is
encouraged, since it provides on-the-wire encryption plus,
optionally, client identity verification via a X.509
certificate provided to the server during TLS handshake.
Alternately, "sec=krb5p" does provide on-the-wire
encryption of file data.

A mount_nfs(8) man page update will be done in a separate commit.

Discussed on: freebsd-current@
MFC after: 3 months


# b039ca07 15-Feb-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Wrap nfsstatsv1_p in the NFSD_VNET() macro

Commit 7344856e3a6d added a lot of macros that will front end
vnet macros so that nfsd(8) can run in vnet prison.
The nfsstatsv1_p variable got missed. This patch wraps all
uses of nfsstatsv1_p with the NFSD_VNET() macro.
The NFSD_VNET() macro is still a null macro.

MFC after: 3 months


# 7e44856e 11-Feb-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Prepare the NFS server code to run in a vnet prison

This patch defines null macros that can be used to apply
the vnet macros for global variables and SYSCTL flags.
It also applies these macros to many of the global variables
and some of the SYSCTLs. Since the macros do nothing, these
changes should not result in semantics changes, although the
changes are large in number.

The patch does change several global variables that were
arrays or structures to pointers to same. For these variables,
modified initialization and cleanup code malloc's and free's
the arrays/structures. This was done so that the vnet footprint
would be about 300bytes when the macros are defined as vnet macros,
allowing nfsd.ko to load dynamically.

I believe the comments in D37519 have been addressed, although
it has never been reviewed, due in part to the large size of the patch.
This is the first of a series of patches that will put D37519 in main.

Once everything is in main, the macros will be defined as front
end macros to the vnet ones.

MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D37519


# aa41036e 22-Jun-2022 Elliott Mitchell <ehem+freebsd@m5p.com>

nfsserver: purge EOL release compatibility

Remove now-obsolete FreeBSD 4.x and earlier ifdef.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/603
Differential Revision: https://reviews.freebsd.org/D35560


# 5a0050e6 15-Jan-2023 Rick Macklem <rmacklem@FreeBSD.org>

nfsserver: Fix handling of SP4_NONE

For NFSv4.1/4.2, when the client specifies SP4_NONE for
state protection in the ExchangeID operation arguments,
the server MUST allow the state management operations for
any user credentials. (I misread the RFC and thought that
SP4_NONE meant "at the server's discression" and not MUST
be allowed.)

This means that the "sec=XXX" field of the "V4:" exports(5)
line only applies to NFSv4.0.

This patch fixes the server to always allow state management
operations for SP4_NONE, which is the only state management
option currently supported. (I have patches that add support
for SP4_MACH_CRED to the server. These will be in a future commit.)

In practice, this bug does not seem to have caused
interoperability problems.

MFC after: 2 weeks


# a75d1ddd 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce V_PCATCH to stop abusing PCATCH


# b875d4f5 27-Aug-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Update console message for no session found

The NFSv4.1/4.2 server generates a console message that indicates
that there is no session. I was until recently perplexed w.r.t. how
this could occur. It turns out that the common cause is multiple NFS
clients with the same /etc/hostid.

The host uuid is used by the FreeBSD NFSv4.1/4.2 client as a unique
identifier for the client. If multiple clients use the same host uuid,
this indicates to the NFSv4.1/4.2 server that they are the same client
and confusion occurs.

This trivial patch modifies the console message to suggest that the
client's /etc/hostid needs to be checked for uniqueness.

Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D36377


# 088ba435 13-Jul-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Fix CreateSession for an established ClientID

I mis-read the RFC w.r.t. handling of the sequenceid
when a CreateSession is done after the initial one
that confirms the ClientID. Fortunately this does
not affect most extant NFSv4.1/4.2 clients, since
they only acquire a single session for TCP for a
ClientID (Solaris might be an exception?).

This patch fixes the server to handle this case,
where the RFC requires the sequenceid be incremented
for each CreateSession and is required to reply to
a retried CreateSession with a cached reply.
It adds a field to nfsclient called lc_prevsess,
which caches the sessionid, which is the only field
in a CreateSession reply that will change for a
retry, to implement this reply cache.

The recent commits up to d4a11b3e3bdd that mark
session slots bad when "intr" and/or "soft" mounts
are used by the client needs this server patch.
Without this patch, the client will do a full
recovery, including a new ClientID, losing all
byte range locks. However, prior to the recent
client commits, the client would hang when all
session slots were bad, so even without this
patch it is not a regression.

PR: 260011
MFC after: 2 weeks


# 40ada74e 09-Jul-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfscl: Add optional support for slots marked bad

This patch adds support for session slots marked bad
to nfsv4_sequencelookup(). An additional boolean
argument indicates if the check for slots marked bad
should be done.

The "cred" argument added to nfscl_reqstart() by
commit 326bcf9394c7 is now passed into nfsv4_setquence()
so that it can optionally set the boolean argument
for nfsv4_sequencelookup(). When optionally enabled,
nfsv4_setsequence() will do a DestroySession when all
slots are marked bad.

Since the code that marks slots bad is not yet committed,
this patch should not result in a semantics change.

PR: 260011
MFC after: 2 weeks


# 5d3fe02c 22-Jun-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Clean up the code by not using the vnode_vtype() macro

The vnode_vtype() macro was used to make the code compatible
with Mac OSX, for the Mac OSX port.
For FreeBSD, this macro just obscured the code, so
avoid using it to clean up the code.

This commit should not result in a semantics change.


# 271f6d52 02-May-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Fix session slot freeing for NFSv4.1/4.2

Without this patch the NFSv4.1/4.2 server erroneously
always frees session slot zero for callbacks. This only
affects 4.1/4.2 mounts if the server has delegations
enabled or is a pNFS configuration. Even for those
cases, the effect is mainly to only use slot 0 for
callbacks, serializing all of them. There is a slight
chance that callbacks will fail if the client performs
them in a different order than received on the TCP
connection.

If this bug affects your server, you will see console
messages like:
newnfs_request: Bad session slot

This patch fixes the problem. Found during a recent
IETF NFSv4 testing event.

PR: 263728
MFC after: 2 weeks


# 17a56f3f 09-Feb-2022 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument

The ESXi NFSv4.1 client bogusly sends the wrong value
for the csa_sequence argument for a Create_session operation.
RFC8881 requires this value to be the same as the sequence
reply from the ExchangeID operation most recently done for
the client ID.

Without this patch, the server replies NFSERR_STALECLIENTID,
which is the correct response for an NFSv4.0 SetClientIDConfirm
but is not the correct error for NFSv4.1/4.2, which is
specified as NFSERR_SEQMISORDERED in RFC8881.
This patch fixes this.

This change does not fix the issue reported in the PR, where
the ESXi client loops, attempting ExchangeID/Create_session
repeatedly.

Reported by: asomers
Tested by: asomers
PR: 261291
MFC after: 1 week


# c302f889 13-Dec-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Limit parsing of layout errors to maxcnt bytes

This patch decrements maxcnt by the appropriate
number of bytes during parsing and checks to see
if there is data remaining. If not, it just returns
from nfsrv_flexlayouterr() without further processing.
This prevents the tl pointer from running off the end
of the error data pointed at by layp, if there are
flaws in the data.

Reported by: rtm@lcs.mit.edu
Tested by: rtm@lcs.mit.edu
PR: 260293
MFC after: 2 weeks


# bdd57cbb 26-Nov-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Add checks for layout errors in LayoutReturn

For a LayoutReturn when using the Flexible File Layout,
error reports may be provided in the request.
Sanity check the size of these error reports and
check that they exist before calling nfsrv_flexlayouterr().

Reported by: rtm@lcs.mit.edu
Tested by: rtm@lcs.mit.edu
PR: 260012
MFC after: 2 weeks


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# ce9676de 12-Nov-2021 Rick Macklem <rmacklem@FreeBSD.org>

pNFS: Add nfsstats counters for number of Layouts

For pNFS, Layouts are issued by the server to indicate
where a file's data resides on the DS(s). This patch
adds counters for how many layouts are allocated to
the nfsstatsv1 structure, using two reserved fields.

MFC after: 2 weeks


# f8dc0630 08-Nov-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Fix the NFSv4.2 pNFS MDS server for NFSERR_NOSPC via LayoutError

If a pNFS server's DS runs out of disk space, it replies
NFSERR_NOSPC to the client doing writing. For the Linux
client, it then sends a LayoutError RPC to the MDS server to
tell it about the error and keeps retrying, doing repeated
LayoutGets to the MDS and Write RPCs to the DS. The Linux client is
"stuck" until disk space on the DS is free'd up unless
a subsequent LayoutGet request is sent a NFSERR_NOSPC
reply.
The looping problem still occurs for NFSv4.1 mounts, but no
fix for this is known at this time.

This patch changes the pNFS MDS server to reply to LayoutGet
operations with NFSERR_NOSPC once a LayoutError reports the
problem, until the DS has available space. This keeps the Linux
NFSv4.2 from looping.

Found during recent testing because of issues w.r.t. a DS
being out of space found during a recent IEFT NFSv4 working
group testing event.

MFC after: 2 weeks


# a7e014ee 07-Nov-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Fix the NFSv4 pNFS MDS server for DS NFSERR_NOSPC

If a pNFS server's DS runs out of disk space, it replies
NFSERR_NOSPC to the client doing writing. For the Linux
client, it then sends a LayoutError RPC to the server to
tell it about the error and keeps retrying, doing repeated
LayoutGet and Write RPCs to the DS. The Linux client is
"stuck" until disk space on the DS is free'd up.
For a mirrored server configuration, the first mirror that
ran out of space was taken offline. This does not make
much sense, since the other mirror(s) will run out of space
soon and the fix is a manual cleanup up disk space.

This patch changes the pNFS server to not disable a mirror
for the mirrored case when this occurs.

Further work is needed, since the Linux client expects the
MDS to reply NFSERR_NOSPC to LayoutGets once the DS is out
of space. Without this further change, the above mentioned
looping occurs.

Found during a recent IEFT NFSv4 working group testing event.

MFC after: 2 weeks


# ee29e6f3 16-Jul-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: Add sysctl to set maximum I/O size up to 1Mbyte

Since MAXPHYS now allows the FreeBSD NFS client
to do 1Mbyte I/O operations, add a sysctl called vfs.nfsd.srvmaxio
so that the maximum NFS server I/O size can be set up to 1Mbyte.
The Linux NFS client can also do 1Mbyte I/O operations.

The default of 128Kbytes for the maximum I/O size has
not been changed for two reasons:
- kern.ipc.maxsockbuf must be increased to support 1Mbyte I/O
- The limited benchmarking I can do actually shows a drop in I/O rate
when the I/O size is above 256Kbytes.
However, daveb@spectralogic.com reports seeing an increase
in I/O rate for the 1Mbyte I/O size vs 128Kbytes using a Linux client.

Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30826


# 1e0a518d 08-Jul-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfscl: Add a Linux compatible "nconnect" mount option

Linux has had an "nconnect" NFS mount option for some time.
It specifies that N (up to 16) TCP connections are to created for a mount,
instead of just one TCP connection.

A discussion on freebsd-net@ indicated that this could improve
client<-->server network bandwidth, if either the client or server
have one of the following:
- multiple network ports aggregated to-gether with lagg/lacp.
- a fast NIC that is using multiple queues
It does result in using more IP port#s and might increase server
peak load for a client.

One difference from the Linux implementation is that this implementation
uses the first TCP connection for all RPCs composed of small messages
and uses the additional TCP connections for RPCs that normally have
large messages (Read/Readdir/Write). The Linux implementation spreads
all RPCs across all TCP connections in a round robin fashion, whereas
this implementation spreads Read/Readdir/Write across the additional
TCP connections in a round robin fashion.

Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30970


# e1a907a2 11-Jun-2021 Rick Macklem <rmacklem@FreeBSD.org>

krpc: Acquire ref count of CLIENT for backchannel use

Michael Dexter <editor@callfortesting.org> reported
a crash in FreeNAS, where the first argument to
clnt_bck_svccall() was no longer valid.
This argument is a pointer to the callback CLIENT
structure, which is free'd when the associated
NFSv4 ClientID is free'd.

This appears to have occurred because a callback
reply was still in the socket receive queue when
the CLIENT structure was free'd.

This patch acquires a reference count on the CLIENT
that is not CLNT_RELEASE()'d until the socket structure
is destroyed. This should guarantee that the CLIENT
structure is still valid when clnt_bck_svccall() is called.
It also adds a check for closed or closing to
clnt_bck_svccall() so that it will not process the callback
RPC reply message after the ClientID is free'd.

Comments by: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30153


# 46269d66 16-May-2021 Rick Macklem <rmacklem@FreeBSD.org>

NFSv4 server: Re-establish the delegation recall timeout

Commit 7a606f280a3e allowed the server to do retries of CB_RECALL
callbacks every couple of seconds. This was needed to allow the
Linux client to re-establish the back channel.
However this patch broke the delegation timeout check, such that
it would just keep retrying CB_RECALLS.
If the client has crashed or been network patitioned from the
server, this continues until the client TCP reconnects to
the server and re-establishes the back channel.

This patch modifies the code such that it still times out the
delegation recall after some minutes, so that the server will
allow the conflicting client request once the delegation times out.

This patch only affects the NFSv4 server when delegations are
enabled and a NFSv4 client that holds a delegation has crashed
or been network partitioned from the server for at least several
minutes when a delegation needs to be recalled.

MFC after: 2 weeks


# 87597731 26-Apr-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: fix the slot sequence# when a callback fails

Commit 4281bfec3628 patched the server so that the
callback session slot would be free'd for reuse when
a callback attempt fails.
However, this can often result in the sequence# for
the session slot to be advanced such that the client
end will reply NFSERR_SEQMISORDERED.

To avoid the NFSERR_SEQMISORDERED client reply,
this patch negates the sequence# advance for the
case where the callback has failed.
The common case is a failed back channel, where
the callback cannot be sent to the client, and
not advancing the sequence# is correct for this
case. For the uncommon case where the client's
reply to the callback is lost, not advancing the
sequence# will indicate to the client that the
next callback is a retry and not a new callback.
But, since the FreeBSD server always sets "csa_cachethis"
false in the callback sequence operation, a retry
and a new callback should be handled the same way
by the client, so this should not matter.

Until you have this patch in your NFSv4.1/4.2 server,
you should consider avoiding the use of delegations.
Even with this patch, interoperation with the
Linux NFSv4.1/4.2 client in kernel versions prior
to 5.3 can result in frequent 15second delays if
delegations are enabled. This occurs because, for
kernels prior to 5.3, the Linux client does a TCP
reconnect every time it sees multiple concurrent
callbacks and then it takes 15seconds to recover
the back channel after doing so.

MFC after: 2 weeks


# 4281bfec 23-Apr-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: fix session slot handling for failed callbacks

When the NFSv4.1/4.2 server does a callback to a client
on the back channel, it will use a session slot in the
back channel session. If the back channel has failed,
the callback will fail and, without this patch, the
session slot will not be released.
As more callbacks are attempted, all session slots
can become busy and then the nfsd thread gets stuck
waiting for a back channel session slot.

This patch frees the session slot upon callback
failure to avoid this problem.

Without this patch, the problem can be avoided by leaving
delegations disabled in the NFS server.

MFC after: 2 weeks


# 5a89498d 19-Apr-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: fix stripe size reply for the File Layout pNFS server

At a recent testing event I found out that I had misinterpreted
RFC5661 where it describes the stripe size in the File Layout's
nfl_util field. This patch fixes the pNFS File Layout server
so that it returns the correct value to the NFSv4.1/4.2 pNFS
enabled client.

This affects almost no one, since pNFS server configurations
are rare and the extant pNFS aware NFS clients seemed to
function correctly despite the erroneous stripe size.
It *might* be needed for correct behaviour if a recent
Linux client mounts a FreeBSD pNFS server configuration
that is using File Layout (non-mirrored configuration).

MFC after: 2 weeks


# 7a606f28 04-Apr-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: make the server repeat CB_RECALL every couple of seconds

Commit 01ae8969a9ee stopped the NFSv4.1/4.2 server from implicitly
binding the back channel to a new TCP connection so that it
conforms to RFC5661, for NFSv4.1/4.2. An effect of this
for the Linux NFS client is that it will do a
BindConnectionToSession when it sees NFSV4SEQ_CBPATHDOWN
set in a sequence reply. This will fix the back channel, but the
first attempt at a callback like CB_RECALL will already have
failed. Without this patch, a CB_RECALL will not be retried
and that can result in a 5 minute delay until the delegation
times out.

This patch modifies the code so that it will retry the
CB_RECALL every couple of seconds, often avoiding the
5 minute delay.

This is not critical for correct behaviour, but avoids
the 5 minute delay for the case where the Linux client
re-binds the back channel via BindConnectionToSession.

MFC after: 2 weeks


# 6f2addd8 04-Apr-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: fix BindConnectionToSession so that it clears "cb path down"

Commit 01ae8969a9ee stopped the NFSv4.1/4.2 server from implicitly
binding the back channel to a new TCP connection so that it
conforms to RFC5661, for NFSv4.1/4.2. An effect of this
for the Linux NFS client is that it will do a
BindConnectionToSession when it sees NFSV4SEQ_CBPATHDOWN
set in a sequence reply. It will do this for every RPC
reply until it no longer sees the flag.
Without that patch, this will happen until the client does
an Open, which will clear LCL_CBDOWN.

This patch clears LCL_CBDOWN right away, so that
NFSV4SEQ_CBPATHDOWN will no longer be sent to the client
in Sequence replies and the Linux client will not repeat
the BindConnectionToSession RPCs.

This is not critical for correct behaviour, but reduces
RPC overheads for cases where the Open will not be done
for a while.

MFC after: 2 weeks


# 01ae8969 30-Mar-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: do not implicitly bind the back channel for NFSv4.1/4.2 mounts

The NFSv4.1 (and 4.2 on 13) server incorrectly binds
a new TCP connection to the back channel when first
used by an RPC with a Sequence op in it (almost all of them).
RFC5661 specifies that only the fore channel should be bound.

This was done because early clients (including FreeBSD)
did not do the required BindConnectionToSession RPC.

Unfortunately, this breaks the Linux client when the
"nconnects" mount option is used, since the server
may do a callback on the incorrect TCP connection.

This patch converts the server behaviour to that
required by the RFC. It also makes the server test/indicate
failure of the back channel more aggressively.

Until this patch is applied to the server, the
"nconnects" mount option is not recommended for a Linux
NFSv4.1/4.2 client mount to the FreeBSD server.

Reported by: bcodding@redhat.com
Tested by: bcodding@redhat.com
PR: 254560
MFC after: 1 week


# dc78533a 01-Jan-2021 Rick Macklem <rmacklem@FreeBSD.org>

nfsd: fix NFSv4.0 seqid handling for ERELOOKUP

Commit 774a36851e0e fixed the NFS server so that it could handle
ERELOOKUP returns from VOP calls by redoing the operation/RPC.
However, for NFSv4.0, redoing an Open would increment
the open_owner's seqid multiple times, breaking the protocol.
This patch sets a new flag called ND_ERELOOKUP on the RPC when
a redo is in progress. Then the code that increments the seqid
avoids the seqid increment/check when the flag is set, since
it indicates this has already been done for the Open.


# ff45b9fc 26-Sep-2020 Rick Macklem <rmacklem@FreeBSD.org>

Bjorn reported a problem where the Linux NFSv4.1 client is
using an open_to_lock_owner4 when that lock_owner4 has already
been created by a previous open_to_lock_owner4. This caused the NFS server
to reply NFSERR_INVAL.

For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes
that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify
what error to return).

For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661,
but the NFSv4.1 server can handle this case without error.
This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple
uses of the same lock_owner in open_to_lock_owner so that it now correctly
interoperates with the Linux NFS client.
It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID.

Thanks go to Bjorn for diagnosing this and testing the patch.
He also provided a program that I could use to reproduce the problem.

Tested by: bj@cebitec.uni-bielefeld.de (Bjorn Fischer)
PR: 249567
Reported by: bj@cebitec.uni-bielefeld.de (Bjorn Fischer)
MFC after: 3 days


# 58dd2b52 18-Sep-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix a LOR between the NFS server and server side krpc.

Recent testing of the NFS-over-TLS code found a LOR between the mutex lock
used for sessions and the sleep lock used for server side krpc socket
structures in nfsrv_checksequence(). This was fixed by r365789.
A similar bug exists in nfsrv_bindconnsess(), where SVC_RELEASE() is called
while mutexes are held.
This patch applies a fix similar to r365789, moving the SVC_RELEASE() call
down to after the mutexes are released.

This patch fixes the problem by moving the SVC_RELEASE() call in
nfsrv_checksequence() down a few lines to below where the mutex is released.

MFC after: 1 week


# a5c55410 15-Sep-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix a LOR between the NFS server and server side krpc.

Recent testing of the NFS-over-TLS code found a LOR between the mutex lock
used for sessions and the sleep lock used for server side krpc socket
structures.
The code in nfsrv_checksequence() would call SVC_RELEASE() with the mutex
held. Normally this is ok, since all that happens is SVC_RELEASE()
decrements a reference count. However, if the socket has just been shut
down, SVC_RELEASE() drops the reference count to 0 and acquires a sleep
lock during destruction of the server side krpc structure.

This patch fixes the problem by moving the SVC_RELEASE() call in
nfsrv_checksequence() down a few lines to below where the mutex is released.

MFC after: 1 week


# 2848d6d4 13-Sep-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix a case where the NFSv4.0 server might crash if delegations are enabled.

asomers@ reported a crash on an NFSv4.0 server with a backtrace of:
kdb_backtrace
vpanic
panic
nfsrv_docallback
nfsrv_checkgetattr
nfsrvd_getattr
nfsrvd_dorpc
nfssvc_program
svc_run_internal
svc_thread_start
fork_exit
fork_trampoline
where the panic message was "docallb", which indicates that a callback
was attempted when the ClientID is unconfirmed.
This would not normally occur, but it is possible to have an unconfirmed
ClientID structure with delegation structure(s) chained off it if the
client were to issue a SetClientID with the same "id" but different
"verifier" after acquiring delegations on the previously confirmed ClientID.

The bug appears to be that nfsrv_checkgetattr() failed to check for
this uncommon case of an unconfirmed ClientID with a delegation structure
that no longer refers to a delegation the client knows about.

This patch adds a check for this case, handling it as if no delegation
exists, which is the case when the above occurs.
Although difficult to reproduce, this change should avoid the panic().

PR: 249127
Reported by: asomers
Reviewed by: asomers
MFC after: 1 week
Differential Revision: https://reviews.freebbsd.org/D26342


# 586ee69f 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

fs: clean up empty lines in .c and .h files


# 02511d21 10-Aug-2020 Rick Macklem <rmacklem@FreeBSD.org>

Add an argument to newnfs_connect() that indicates use TLS for the connection.

For NFSv4.0, the server creates a server->client TCP connection for callbacks.
If the client mount on the server is using TLS, enable TLS for this callback
TCP connection.
TLS connections from clients will not be supported until the kernel RPC
changes are committed.

Since this changes the internal ABI between the NFS kernel modules that
will require a version bump, delete newnfs_trimtrailing(), which is no
longer used.

Since LCL_TLSCB is not yet set, these changes should not have any semantic
affect at this time.


# 245bfd34 20-May-2020 Ryan Moeller <freqlabs@FreeBSD.org>

Deduplicate fsid comparisons

Comparing fsid_t objects requires internal knowledge of the fsid structure
and yet this is duplicated across a number of places in the code.

Simplify by creating a fsidcmp function (macro).

Reviewed by: mjg, rmacklem
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D24749


# b9cc3262 12-May-2020 Ryan Moeller <freqlabs@FreeBSD.org>

nfs: Remove APPLESTATIC macro

It is no longer useful.

Reviewed by: rmacklem
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D24811


# 32033b3d 08-May-2020 Ryan Moeller <freqlabs@FreeBSD.org>

Remove APPLEKEXT ifndefs

They are no longer useful.

Reviewed by: rmacklem
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D24752


# ae070589 17-Apr-2020 Rick Macklem <rmacklem@FreeBSD.org>

Replace all instances of the typedef mbuf_t with "struct mbuf *".

The typedef mbuf_t was used for the Mac OS/X port of the code long ago.
Since this port is no longer used and the use of mbuf_t obscures what
the code does (and is not consistent with style(9)), it is no longer needed.
This patch replaces all instances of mbuf_t with "struct mbuf *", so that
it is no longer used.

This patch should not result in any semantic change.


# 9f6624d3 11-Apr-2020 Rick Macklem <rmacklem@FreeBSD.org>

Replace mbuf macros with the code they would generate in the NFS code.

When the code was ported to Mac OS/X, mbuf handling functions were
converted to using the Mac OS/X accessor functions. For FreeBSD, they
are a simple set of macros in sys/fs/nfs/nfskpiport.h.
Since porting to Mac OS/X is no longer a consideration, replacement of
these macros with the code generated by them makes the code more
readable.
When support for external page mbufs is added as needed by the KERN_TLS,
the patch becomes simpler if done without the macros.

This patch should not result in any semantic change.


# 76fd19b0 06-Apr-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix noisy NFSv4 server printf.

Peter reported that his dmesg was getting cluttered with
nfsrv_cache_session: no session
messages when he rebooted his NFS server and they did not seem useful.
He was correct, in that these messages are "normal" and expected when
NFSv4.1 or NFSv4.2 are mounted and the server is rebooted.
This patch silences the printf() during the grace period after a reboot.
It also adds the client IP address to the printf(), so that the message
is more useful if/when it occurs. If this happens outside of the
server's grace period, it does indicate something is not working correctly.
Instead of adding yet another nd_XXX argument, the arguments for
nfsrv_cache_session() were simplified to take a "struct nfsrv_descript *".

Reported by: pen@lysator.liu.se
MFC after: 2 weeks


# 60a09a94 26-Jan-2020 Rick Macklem <rmacklem@FreeBSD.org>

Fix a crash in the NFSv4 server.

The PR reported a crash that occurred when a file was removed while
client(s) were actively doing lock operations on it.
Since nfsvno_getvp() will return NULL when the file does not exist,
the bug was obvious and easy to fix via this patch. It is a little
surprising that this wasn't found sooner, but I guess the above
case rarely occurs.

Tested by: iron.udjin@gmail.com
PR: 242768
Reported by: iron.udjin@gmail.com
MFC after: 2 weeks


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# f808cf72 13-Dec-2019 Rick Macklem <rmacklem@FreeBSD.org>

Silence some "might not be initialized" warnings for riscv64.

None of these case were actually using the variable(s) uninitialized, but
I figured that silencing the warnings via initializing them made sense.

Some of these predated r355677.


# c057a378 12-Dec-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add support for NFSv4.2 to the NFS client and server.

This patch adds support for NFSv4.2 (RFC-7862) and Extended Attributes
(RFC-8276) to the NFS client and server.
NFSv4.2 is comprised of several optional features that can be supported
in addition to NFSv4.1. This patch adds the following optional features:
- posix_fadvise(POSIX_FADV_WILLNEED/POSIX_FADV_DONTNEED)
- posix_fallocate()
- intra server file range copying via the copy_file_range(2) syscall
--> Avoiding data tranfer over the wire to/from the NFS client.
- lseek(SEEK_DATA/SEEK_HOLE)
- Extended attribute syscalls for "user" namespace attributes as defined
by RFC-8276.

Although this patch is fairly large, it should not affect support for
the other versions of NFS. However it does add two new sysctls that allow
a sysadmin to limit which minor versions of NFSv4 a server supports, allowing
a sysadmin to disable NFSv4.2.

Unfortunately, when the NFS stats structure was last revised, it was assumed
that there would be no additional operations added beyond what was
specified in RFC-7862. However RFC-8276 did add additional operations,
forcing the NFS stats structure to revised again. It now has extra unused
entries in all arrays, so that future extensions to NFSv4.2 can be
accomodated without revising this structure again.

A future commit will update nfsstat(1) to report counts for the new NFSv4.2
specific operations/procedures.

This patch affects the internal interface between the nfscommon, nfscl and
nfsd modules and, as such, they all must be upgraded simultaneously.
I will do a version bump (although arguably not needed), due to this.

This code has survived a "make universe" but has not been built with a
recent GCC. If you encounter build problems, please email me.

Relnotes: yes


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# 2e670777 04-Sep-2019 Rick Macklem <rmacklem@FreeBSD.org>

Delete the unused "nd" argument for nfsrv_checkdsattr().

The "nd" argument for nfsrv_checkdsattr() is no longer used by the function.
This patch deletes it. This allows subsequent patches to delete the "nd"
argument from nfsrv_proxyds(), since it's only use of "nd" was to pass it
to nfsrv_checkdsattr(). The same will then be true for nfsvno_getattr(),
which passes "nd" to nfsrv_proxyds().
Getting rid of the "nd" argument from nfsvno_getattr() avoids confusion
over why it might need "nd".

This patch is trivial and does not have any semantic effect.
Found by inspection while working on the NFSv4.2 server.


# ed2f1001 13-Apr-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add support for INET6 addresses to the kernel code that dumps open/lock state.

PR#223036 reported that INET6 callback addresses were not printed by
nfsdumpstate(8). This kernel patch adds INET6 addresses to the dump structure,
so that nfsdumpstate(8) can print them out, post-r346190.
The patch also includes the addition of #ifdef INET, INET6 as requested
by bz@.

PR: 223036
Reviewed by: bz, rgrimes
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19839


# fdab4d3b 18-Aug-2018 Rick Macklem <rmacklem@FreeBSD.org>

Fix LORs between vn_start_write() and vn_lock() in nfsrv_copymr().

When coding the pNFS server, I added vn_start_write() calls in nfsrv_copymr()
done while the vnodes were locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes the LORs by moving the vn_start_write() calls up to before
where the vnodes are locked. For "tvp", the vn_start_write() probaby isn't
necessary, because NFS mounts can't be suspended. However, I think doing
so is harmless.
Thanks go to kib@ for letting me know that I had introduced these LORs.
This patch only affects the behaviour of the pNFS server when pnfsdscopymr(8)
is used to recover a mirrored DS.


# 41df1b5b 08-Aug-2018 Rick Macklem <rmacklem@FreeBSD.org>

Assorted fixes to handling of LayoutRecall callbacks, mostly error handling.

After a re-read of the appropriate section of RFC5661, I decided that a
few things should be changed related to LayoutRecall callback handling.
Here are the things fixed by this patch.
- For two of the three cases that LayoutRecall is done, I now think
setting the clora_changed argument false is correct.
- All errors other than NFSERR_DELAY returned by LayoutRecall appear
permanent, so don't retry for any of them. (NFSERR_DELAY is retried by
newnfs_request(), so it is not affected by this patch.)
- Instead of waiting "forever" (actually until the process is SIGTERM'd)
for Layouts to be returned during a mirror copy, fail and return
ENXIO after about 1minute.
Waiting for a <ctrl>C made sense when pnfsdscopymr() was done by itself,
but did not make sense when done via find(1).
This patch only affects the pNFS server.


# 25705dd5 05-Aug-2018 Rick Macklem <rmacklem@FreeBSD.org>

Copy all bits of a file handle in case there is padding in the structure.

At least on x86, fhandle_t is a packed structure, so I believe an
assignment will copy all the bits. However, for some current/future
architectures, there might be padding in the structure that doesn't get
copied via an assignment.
Since NFS assumes a file handle is an opaque blob of bits that can be
compared via memcmp()/bcmp(), all the bits including any padding must be
copied.
This patch replaces the assignments with a call to a byte copy function.
Spotted during code inspection.


# 8014c971 29-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Silence newer gcc warnings.

Newer versions of gcc generate "set, but not used" warnings in the NFS server.
Add __unused macros to silence these warnings.

Requested by: mmacy


# a3e709cd 28-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Modify the NFSv4.1 server so that it allows ReclaimComplete as done by ESXi 6.7.

I believe that a ReclaimComplete with rca_one_fs == TRUE is only
to be used after a file system has been transferred to a different
file server. However, RFC5661 is somewhat vague w.r.t. this and
the ESXi 6.7 client does both a ReclaimComplete with rca_one_fs == TRUE
and one with ReclaimComplete with rca_one_fs == FALSE.
Therefore, just ignore the rca_one_fs == TRUE operation and return
NFS_OK without doing anything instead of replying NFS4ERR_NOTSUPP.
This allows the ESXi 6.7 NFSv4.1 client to do a mount.
After discussion on the NFSv4 IETF working group mailing list, doing this
along with setting a flag to note that a ReclaimComplete with rca_one_fs TRUE
was an appropriate way to handle this.
The flag that indicates that a ReclaimComplete with rca_one_fs == TRUE was
done may be used to disable replies of NFS4ERR_GRACE for non-reclaim
state operations in a future commit.

This patch along with r332790, r334492 and r336357 allow ESXi 6.7 NFSv4.1 mounts
work ok. ESX 6.5 NFSv4.1 mounts do not work well, due to what I believe are
violations of RFC-5661 and should not be used.

Reported by: andreas.nagy@frequentis.com
Tested by: andreas.nagy@frequentis.com, daniel@ftml.net (earlier version)
MFC after: 2 weeks
Relnotes: yes


# 5d54f186 16-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Modify the reasons for not issuing a delegation in the NFSv4.1 server.

The ESXi NFSv4.1 client will generate warning messages when the reason for
not issuing a delegation is two. Two refers to a resource limit and I do
not see why it would be considered invalid. However it probably was not the
best choice of reason for not issuing a delegation.
This patch changes the reasons used to ones that the ESXi client doesn't
complain about. This change does not affect the FreeBSD client and does
not appear to affect behaviour of the Linux NFSv4.1 client.
RFC5661 defines these "reasons" but does not give any guidance w.r.t. which
ones are more appropriate to return to a client.

Tested by: andreas.nagy@frequentis.com
PR: 226650
MFC after: 2 weeks


# de9a1a70 09-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add support for a "forced" pnfsdskill to the pNFS server kernel code.

The pnfsdskill(8) command will normally fail if there is no valid mirror
for the DS to be disabled. However, a system administrator may need to
disable a DS which does not have a valid mirror so that the nfsd threads
can be terminated. This patch adds the kernel code needed by pnfsdskill(8)
to implement this "forced" case of disabling a DS.
This patch only affects the pNFS server.


# acc6e58d 08-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Fix the kernel part of pnfsdscopymr() to handle holes in the file being copied.

If a mirrored DS is being recovered that has a lot of large sparse files,
pnfsdscopymr(8) would use a lot of space on the recovered mirror since it
would write the "holes" in the file being mirrored.
This patch adds code to check for a "hole" and skip doing the write.
The check is done on a "per PNFSDS_COPYSIZ size block", which is currently 64K.
I think that most file server file systems will be using a blocksize at
least this large. If the file server is using a smaller blocksize and
smaller holes need to be preserved, PNFSDS_COPYSIZ could be decreased.
The block of 0s is malloc()d, since pnfsdcopymr(8) should be an infrequent
occurrence.


# 5b500ea9 06-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Change the pNFS server so that it does not disable a mirrored DS for
an NFSERR_STALE error reported via a LayoutReturn.

The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This change only affects the pNFS server and only when a client does a
LayoutReturn with the NFSERR_STALE error report.


# ff3b992f 04-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Fix the pNFS server so that it handles the "#mds_path" check for mirrors.

The recently added feature of the pNFS server will set an fsid for the
MDS file system to define the file system a DS should store files for.
For a case where a DS handling all file systems has failed, it was possible
for the code to check for a mirror with a specified fs, even though
nfsdev_mdsisset was 0, possibly causing a false successful check for a mirror.
This patch adds a check for nfsdev_mdsisset != 0 to avoid this.
It only affects the pNFS server for a rare case. Found via code inspection.


# 2f32675c 02-Jul-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add an optional feature to the pNFS server.

Without this patch, the pNFS server distributes the data storage files across
all of the specified DSs.
A tester noted that it would be nice if a system administrator could control
which DSs are used to store the file data for a given exported MDS file system.
This patch adds the kernel support to do this. It also makes a slight semantic
change to nfsv4_findmirror(), since some uses of it no longer require that
the DS being searched for have a current mirror.
A patch that will be committed in a few minutes will modify the nfsd daemon
to support this feature.
The patch should only affect sites using the pNFS server (specified via the
"-p" command line option for nfsd.

Suggested by: james.rose@framestore.com


# c16f407e 21-Jun-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add a counter to limit the number of disabled DSs for a mirrored pNFS MDS.

This patch adds a counter that limits the number of disabled mirrored DSs
to mirror level - 1. It also makes a small change that keeps a Write that
has failed with EACCES when attempted by a client to a DS from disabling
the DS.
This patch only affects the pNFS server.


# 90d2dfab 12-Jun-2018 Rick Macklem <rmacklem@FreeBSD.org>

Merge the pNFS server code from projects/pnfs-planb-server into head.

This code merge adds a pNFS service to the NFSv4.1 server. Although it is
a large commit it should not affect behaviour for a non-pNFS NFS server.
Some documentation on how this works can be found at:
http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
and will hopefully be turned into a proper document soon.
This is a merge of the kernel code. Userland and man page changes will
come soon, once the dust settles on this merge.
It has passed a "make universe", so I hope it will not cause build problems.
It also adds NFSv4.1 server support for the "current stateid".

Here is a brief overview of the pNFS service:
A pNFS service separates the Read/Write oeprations from all the other NFSv4.1
Metadata operations. It is hoped that this separation allows a pNFS service
to be configured that exceeds the limits of a single NFS server for either
storage capacity and/or I/O bandwidth.
It is possible to configure mirroring within the data servers (DSs) so that
the data storage file for an MDS file will be mirrored on two or more of
the DSs.
When this is used, failure of a DS will not stop the pNFS service and a
failed DS can be recovered once repaired while the pNFS service continues
to operate. Although two way mirroring would be the norm, it is possible
to set a mirroring level of up to four or the number of DSs, whichever is
less.
The Metadata server will always be a single point of failure,
just as a single NFS server is.

A Plan B pNFS service consists of a single MetaData Server (MDS) and K
Data Servers (DS), all of which are recent FreeBSD systems.
Clients will mount the MDS as they would a single NFS server.
When files are created, the MDS creates a file tree identical to what a
single NFS server creates, except that all the regular (VREG) files will
be empty. As such, if you look at the exported tree on the MDS directly
on the MDS server (not via an NFS mount), the files will all be of size 0.
Each of these files will also have two extended attributes in the system
attribute name space:
pnfsd.dsfile - This extended attrbute stores the information that
the MDS needs to find the data storage file(s) on DS(s) for this file.
pnfsd.dsattr - This extended attribute stores the Size, AccessTime, ModifyTime
and Change attributes for the file, so that the MDS doesn't need to
acquire the attributes from the DS for every Getattr operation.
For each regular (VREG) file, the MDS creates a data storage file on one
(or more if mirroring is enabled) of the DSs in one of the "dsNN"
subdirectories. The name of this file is the file handle
of the file on the MDS in hexadecimal so that the name is unique.
The DSs use subdirectories named "ds0" to "dsN" so that no one directory
gets too large. The value of "N" is set via the sysctl vfs.nfsd.dsdirsize
on the MDS, with the default being 20.
For production servers that will store a lot of files, this value should
probably be much larger.
It can be increased when the "nfsd" daemon is not running on the MDS,
once the "dsK" directories are created.

For pNFS aware NFSv4.1 clients, the FreeBSD server will return two pieces
of information to the client that allows it to do I/O directly to the DS.
DeviceInfo - This is relatively static information that defines what a DS
is. The critical bits of information returned by the FreeBSD
server is the IP address of the DS and, for the Flexible
File layout, that NFSv4.1 is to be used and that it is
"tightly coupled".
There is a "deviceid" which identifies the DeviceInfo.
Layout - This is per file and can be recalled by the server when it
is no longer valid. For the FreeBSD server, there is support
for two types of layout, call File and Flexible File layout.
Both allow the client to do I/O on the DS via NFSv4.1 I/O
operations. The Flexible File layout is a more recent variant
that allows specification of mirrors, where the client is
expected to do writes to all mirrors to maintain them in a
consistent state. The Flexible File layout also allows the
client to report I/O errors for a DS back to the MDS.
The Flexible File layout supports two variants referred to as
"tightly coupled" vs "loosely coupled". The FreeBSD server always
uses the "tightly coupled" variant where the client uses the
same credentials to do I/O on the DS as it would on the MDS.
For the "loosely coupled" variant, the layout specifies a
synthetic user/group that the client uses to do I/O on the DS.
The FreeBSD server does not do striping and always returns
layouts for the entire file. The critical information in a layout
is Read vs Read/Writea and DeviceID(s) that identify which
DS(s) the data is stored on.

At this time, the MDS generates File Layout layouts to NFSv4.1 clients
that know how to do pNFS for the non-mirrored DS case unless the sysctl
vfs.nfsd.default_flexfile is set non-zero, in which case Flexible File
layouts are generated.
The mirrored DS configuration always generates Flexible File layouts.
For NFS clients that do not support NFSv4.1 pNFS, all I/O operations
are done against the MDS which acts as a proxy for the appropriate DS(s).
When the MDS receives an I/O RPC, it will do the RPC on the DS as a proxy.
If the DS is on the same machine, the MDS/DS will do the RPC on the DS as
a proxy and so on, until the machine runs out of some resource, such as
session slots or mbufs.
As such, DSs must be separate systems from the MDS.

Tested by: james.rose@framestore.com
Relnotes: yes


# 9442a64e 01-Jun-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add the BindConnectiontoSession operation to the NFSv4.1 server.

Under some fairly unusual circumstances, the Linux NFSv4.1 client is
doing a BindConnectiontoSession operation for TCP connections.
It is also used by the ESXi6.5 NFSv4.1 client.
This patch adds this operation to the NFSv4.1 server.

Reported by: andreas.nagy@frequentis.com
Tested by: andreas.nagy@frequentis.com
MFC after: 2 weeks


# 440e2f9e 30-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

Strengthen locking for the NFSv4.1 server DestroySession operation.

If a client did a DestroySession on a session while it was still in use,
the server might try to use the session structure after it is free'd.
I think the client has violated RFC5661 if it does this, but this patch
makes DestroySession block all other nfsd threads so no thread could
be using the session when it is free'd. After the DestroySession, nfsd
threads will not be able to find the session. The patch also adds a check
for nd_sessionid being set, although if that was not the case it would have
been all 0s and unlikely to have a false match.
This might fix the crashes described in PR#228497 for the FreeNAS server.

PR: 228497
MFC after: 1 week


# 04b19055 17-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add a missing nfsrv_freesession() call for an unlikely failure case.

Since NFSv4.1 clients normally create a single session which supports
both fore and back channels, it is unlikely that a callback will fail
due to a lack of a back channel.
However, if this failure occurred, the session wasn't being dereferenced
and would never be free'd.
Found by inspection during pNFS server development.

Tested by: andreas.nagy@frequentis.com
MFC after: 2 months


# 0ebe2634 15-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

End grace for the NFSv4 server if all mounts do ReclaimComplete.

The NFSv4 protocol requires that the server only allow reclaim of state
and not issue any new open/lock state for a grace period after booting.
The NFSv4.0 protocol required this grace period to be greater than the
lease duration (over 2minutes). For NFSv4.1, the client tells the server
that it has done reclaiming state by doing a ReclaimComplete operation.
If all NFSv4 clients are NFSv4.1, the grace period can end once all the
clients have done ReclaimComplete, shortening the time period considerably.
This patch does this. If there are any NFSv4.0 mounts, the grace period
will still be over 2minutes.
This change is only an optimization and does not affect correct operation.

Tested by: andreas.nagy@frequentis.com
MFC after: 2 months


# 0f13d146 12-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

Fix a slow leak of session structures in the NFSv4.1 server.

For a fairly rare case of a client doing an ExchangeID after a hard reboot,
the old confirmed clientid still exists, but some clients use a new
co_verifier. For this case, the server was not freeing up the sessions on
the old confirmed clientid.
This patch fixes this case. It also adds two LIST_INIT() macros, which are
actually no-ops, since the structure is malloc()d with M_ZERO so the pointer
is already set to NULL.
It should have minimal impact, since the only way I could exercise this
code path was by doing a hard power cycle (pulling the plus) on a machine
running Linux with a NFSv4.1 mount on the server.
Originally spotted during testing of the ESXi 6.5 client.

Tested by: andreas.nagy@frequentis.com
MFC after: 2 months


# bb343696 12-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

The NFSv4.1 server should return NFSERR_BACKCHANBUSY instead of NFS_OK.

When an NFSv4.1 session is busy due to a callback being in progress,
nfsrv_freesession() should return NFSERR_BACKCHANBUSY instead of NFS_OK.
The only effect this has is that the DestroySession operation will report
the failure for this case and this probably has little or no effect on a
client. Spotted by inspection and no failures related to this have been
reported.

MFC after: 2 months


# 5d4835e4 11-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add support for the TestStateID operation to the NFSv4.1 server.

The Linux client now uses the TestStateID operation, so this patch adds
support for it to the NFSv4.1 server. The FreeBSD client never uses this
operation, so it should not be affected.

MFC after: 2 months


# 7427a9f1 02-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

Revert r333183, since I am not sure that just initializing the
list is the correct thing to do and that is already done without
this commit.


# 858bb2fc 02-May-2018 Rick Macklem <rmacklem@FreeBSD.org>

Add two missing LIST_INIT()s.

This patch adds two missing LIST_INIT()s. Found by inspection.
In practice, these are currently no-ops, since the structure they are
in is malloc'd with M_ZERO and all LIST_INIT does is set the pointer
in the list head to NULL. (In other words, the M_ZERO has already
correctly initialized it.)

MFC after: 2 months


# b97b91b5 25-Jan-2018 Conrad Meyer <cem@FreeBSD.org>

nfs: Remove NFSSOCKADDRALLOC, NFSSOCKADDRFREE macros

They were just thin wrappers over malloc(9) w/ M_ZERO and free(9).

Discussed with: rmacklem, markj
Sponsored by: Dell EMC Isilon


# 222daa42 25-Jan-2018 Conrad Meyer <cem@FreeBSD.org>

style: Remove remaining deprecated MALLOC/FREE macros

Mechanically replace uses of MALLOC/FREE with appropriate invocations of
malloc(9) / free(9) (a series of sed expressions). Something like:

* MALLOC(a, b, ... -> a = malloc(...
* FREE( -> free(
* free((caddr_t) -> free(

No functional change.

For now, punt on modifying contrib ipfilter code, leaving a definition of
the macro in its KMALLOC().

Reported by: jhb
Reviewed by: cy, imp, markj, rmacklem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14035


# becffb36 18-Jan-2018 Emmanuel Vadot <manu@FreeBSD.org>

nfs: Do not printf each time a lock structure is freed during module unload

There can be a lot of those structures and printing a line each time we free
one on module unload.

MFC after: 3 days


# 151ba793 24-Dec-2017 Alexander Kabaev <kan@FreeBSD.org>

Do pass removing some write-only variables from the kernel.

This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385


# b07bc39a 04-Dec-2017 Rick Macklem <rmacklem@FreeBSD.org>

Avoid the overhead of acquiring a lock in nfsrv_checkgetattr() when
there are no write delegations issued.

manu@ reported on the freebsd-current@ mailing list that there was
a significant performance hit in nfsrv_checkgetattr() caused by
the acquisition/release of a state lock, even when there were no
write delegations issued.
This patch add a count of outstanding issued write delegations to the
NFSv4 server. This count allows nfsrv_checkgetattr() to return without
acquiring any lock when the count is 0, avoiding the performance hit
for the case where no write delegations are issued.

Reported by: manu
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13327


# d63027b6 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/fs: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 57ef3db3 15-Oct-2017 Rick Macklem <rmacklem@FreeBSD.org>

Fix the client IP address reported by nfsdumpstate for 64bit arch and NFSv4.1.

The client IP address was not being reported for some NFSv4 mounts by
nfsdumpstate. Upon investigation, two problems were found for mounts
using IPv4. One was that the code (originally written and tested on i386)
assumed that a "u_long" was a "uint32_t" and would exactly store an
IPv4 host address. Not correct for 64bit arches.
Also, for NFSv4.1 mounts, the field was not being filled in. This was
basically correct, because NFSv4.1 does not use a callback address.
However, it meant that nfsdumpstate could not report the client IP addr.
This patch should fix both of these issues.
For IPv6, the address will still not be reported. The original NFSv4 RFC
only specified IPv4 callback addresses. I think this has changed and, if so,
a future commit to fix reporting of IPv6 addresses will be needed.

Reported by: manu
PR: 223036
MFC after: 2 weeks


# 858f6fe3 24-Apr-2017 Rick Macklem <rmacklem@FreeBSD.org>

Allow use of a write open stateid for reading in the NFSv4 server.

The NFSv4 RFCs give a server the option of allowing the use of an open
stateid for write access to be used for a Read operation.
This patch enables this by default and adds a sysctl to disable it,
for anyone who does not want this capability.
Allowing this is particularily useful for a pNFS Data Server (DS), since
they are not permitted to allow the use of special stateids.
Discovered during recent testing of the pNFS server under development.

MFC after: 2 weeks


# b4b4b530 28-Jan-2017 Baptiste Daroussin <bapt@FreeBSD.org>

Revert crap accidentally committed


# 814aaaa7 28-Jan-2017 Baptiste Daroussin <bapt@FreeBSD.org>

Revert r312923 a better approach will be taken later


# a5d19b81 05-Dec-2016 Rick Macklem <rmacklem@FreeBSD.org>

Fix the NFSv4.1 server for Open reclaim after a reboot.

The NFSv4.1 server failed to update the nfs-stablerestart file for
a client when the client was issued its first Open. As such, recovery
of Opens after a server reboot failed with NFSERR_NOGRACE.
This patch fixes this.
It also changes the code so that it malloc()'s the 1024 byte array
instead of allocating it on the kernel stack for both NFSv4.0 and NFSv4.1.
Note that this bug only affected NFSv4.1 and only when clients attempted
to reclaim Opens after a server reboot.

MFC after: 2 weeks


# dcb19c38 20-Oct-2016 Rick Macklem <rmacklem@FreeBSD.org>

A problem w.r.t. interoperation between the FreeBSD NFSv4.1 server with
delegations enabled and the Linux NFSv4.1 client was reported in
reviews.freebsd.org/D7891.
I believe that the FreeBSD server behaviour conforms to the RFC and that
the Linux client has a bug. Therefore, I do not think the proposed patch
is appropriate. When nfsrv_writedelegifpos is non-zero, the FreeBSD
server will issue a write delegation for a read open if possible.
The Linux client then erroneously assumes that the credentials used for
the read open can write the file.
This patch reverses the default value for nfsrv_writedelegifpos to 0 so
that the default behaviour is Linux compatible and adds a sysctl that can
be used to set nfsrv_writedelegifpos.

This change should only affect users that are mounting a FreeBSD server
with delegations enabled (they are not enabled by default) with a Linux
NFSv4.1 client mount.

Reported by: fatih.acar@gandi.net
Tested by: fatih.acar@gandi.net
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D7891


# 1b819cf2 12-Aug-2016 Rick Macklem <rmacklem@FreeBSD.org>

Update the nfsstats structure to include the changes needed by
the patch in D1626 plus changes so that it includes counts for
NFSv4.1 (and the draft of NFSv4.2).
Also, make all the counts uint64_t and add a vers field at the
beginning, so that future revisions can easily be implemented.
There is code in place to handle the old vesion of the nfsstats
structure for backwards binary compatibility.

Subsequent commits will update nfsstat(8) to use the new fields.

Submitted by: will (earlier version)
Reviewed by: ken
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D1626


# a96c9b30 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

NFS: spelling fixes on comments.

No funcional change.


# 0533d726 22-Apr-2016 Rick Macklem <rmacklem@FreeBSD.org>

Fix a LOR in the NFSv4.1 server.

The ordering of acquisition of the state and session mutexes was
reversed in two cases executed when an NFSv4.1 client created/freed
a session. Since clients will typically do this only when mounting
and dismounting, the likelyhood of causing a deadlock was low but possible.
This can only occur for NFSv4.1 mounts, since the others do not
use sessions.
This was detected while testing the pNFS server/client where the
client crashed during dismounting.
The patch also reorders the unlocks, although that isn't necessary
for correct operation.

MFC after: 2 weeks


# a0962bf8 21-Nov-2015 Rick Macklem <rmacklem@FreeBSD.org>

When the nfsd threads are terminated, the NFSv4 server state
(opens, locks, etc) is retained, which I believe is correct behaviour.
However, for NFSv4.1, the server also retained a reference to the xprt
(RPC transport socket structure) for the backchannel. This caused
svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed
a socket upcall to occur after the mutexes in the svcpool were destroyed,
causing a crash.
This patch fixes the code so that the backchannel xprt structure is
dereferenced just before svcpool_destroy() is called, so the code
does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.

Tested by: g_amanakis@yahoo.com
PR: 204340
MFC after: 2 weeks


# 29dc40b6 14-Aug-2015 Rick Macklem <rmacklem@FreeBSD.org>

For the case where an NFSv4.1 ExchangeID operation has the client identifier
that already has a confirmed ClientID, the nfsrv_setclient() function would
not fill in the clientidp being returned. As such, the value of ClientID
returned would be whatever garbage was on the stack.
An NFSv4.1 client would not normally do this, but it appears that it can
happen for certain Linux clients. When it happens, the client persistently
retries the ExchangeID and Create_session after Create_session fails when
it uses the bogus clientid. With this patch, the correct clientid is replied.
This problem was identified in a packet trace supplied by
Ahmed Kamal via email.

Reported by: email.ahmedkamal@googlemail.com
MFC after: 2 weeks


# 25f37276 29-Jul-2015 Rick Macklem <rmacklem@FreeBSD.org>

This patch fixes a problem where, if the NFSv4 server has a previous
unconfirmed clientid structure for the same client on the last hash list,
this old entry would not be removed/deleted. I do not think this bug would have
caused serious problems, since the new entry would have been before the old one
on the list. This old entry would have eventually been scavenged/removed.
Detected while reading the code looking for another bug.

MFC after: 3 days


# 1f54e596 27-May-2015 Rick Macklem <rmacklem@FreeBSD.org>

Make the size of the hash tables used by the NFSv4 server tunable.
No appreciable change in performance was observed after increasing
the sizes of these tables and then testing with a single client.
However, there was an email that indicated high CPU overheads for
a heavily loaded NFSv4 and it is hoped that increasing the sizes
of the hash tables via these tunables might help.
The tables remain the same size by default.

Differential Revision: https://reviews.freebsd.org/D2596
MFC after: 2 weeks


# 52f1bb38 24-Dec-2014 Rick Macklem <rmacklem@FreeBSD.org>

A deadlock in the NFSv4 server with vfs.nfsd.enable_locallocks=1
was reported via email. This was caused by a LOR between the
sleep lock used to serialize the local locking (nfsrv_locklf())
and locking the vnode. I believe this patch fixes the problem
by delaying relocking of the vnode until the sleep lock is
unlocked (nfsrv_unlocklf()). To avoid nfsvno_advlock() having the side
effect of unlocking the vnode, unlocking the vnode was moved to before
the functions that call nfsvno_advlock().
It shouldn't affect the execution of the default case where
vfs.nfsd.enable_locallocks=0.

Reported by: loic.blot@unix-experience.fr
Discussed with: kib
MFC after: 1 week


# d8a5961f 02-Oct-2014 Marcelo Araujo <araujo@FreeBSD.org>

Fix failures and warnings reported by newpynfs20090424 test tool.
This fix addresses only issues with the pynfs reports, none of these
issues are know to create problems for extant real clients.

Submitted by: Bart Hsiao <bart.hsiao@gmail.com>
Reworked by: myself
Reviewed by: rmacklem
Approved by: rmacklem
Sponsored by: QNAP Systems Inc.


# c59e4cc3 01-Jul-2014 Rick Macklem <rmacklem@FreeBSD.org>

Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

MFC after: 1 month


# ab7f2410 23-Apr-2014 Rick Macklem <rmacklem@FreeBSD.org>

Remove an unnecessary level of indirection for an argument.
This simplifies the code and should avoid the clang sparc
port from generating an abort() call.

Requested by: rdivacky
Submitted by: jhb
MFC after: 2 weeks


# 43a213bb 24-Dec-2013 Rick Macklem <rmacklem@FreeBSD.org>

The NFSv4 server would call VOP_SETATTR() with a shared locked vnode
when a Getattr for a file is done by a client other than the one that
holds the file's delegation. This would only happen when delegations
are enabled and the problem is fixed by this patch.

MFC after: 1 week


# a89a2c8b 25-Jan-2013 John Baldwin <jhb@FreeBSD.org>

Further cleanups to use of timestamps in NFS:
- Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds
portion of wall-time stamps to manage timeouts on events.
- Remove unused nd_starttime from the per-request structure in the new
NFS server.
- Use nanotime() for the modification time on a delegation to get as
precise a time as possible.
- Use time_second instead of extracting the second from a call to
getmicrotime().

Submitted by: bde (3)
Reviewed by: bde, rmacklem
MFC after: 2 weeks


# 1f60bfd8 08-Dec-2012 Rick Macklem <rmacklem@FreeBSD.org>

Move the NFSv4.1 client patches over from projects/nfsv4.1-client
to head. I don't think the NFS client behaviour will change unless
the new "minorversion=1" mount option is used. It includes basic
NFSv4.1 support plus support for pNFS using the Files Layout only.
All problems detecting during an NFSv4.1 Bakeathon testing event
in June 2012 have been resolved in this code and it has been tested
against the NFSv4.1 server available to me.
Although not reviewed, I believe that kib@ has looked at it.


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 2108487e 12-May-2012 Rick Macklem <rmacklem@FreeBSD.org>

Fix two cases in the new NFS server where a tsleep() is
used, when the code should actually protect the tested
variable with a mutex. Since the tsleep()s had a 10sec
timeout, the race would have only delayed the allocation
of a new clientid for a client. The sleeps will also
rarely occur, since having a callback in progress when
a client acquires a new clientid, is unlikely.
in practice, since having a callback in progress when
a fresh clientid is being acquired by a client is unlikely.

MFC after: 1 month


# 526d0bd5 20-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 5b79362b 13-Jan-2012 Rick Macklem <rmacklem@FreeBSD.org>

Tai Horgan reported via email that there were two places in
the new NFSv4 server where the code follows the wrong list.
Fortunately, for these fairly rare cases, the lc_stateid[]
lists are normally empty. This patch fixes the code to
follow the correct list.

Reported by: tai.horgan at isilon.com
Discussed with: zack
MFC after: 2 weeks


# a9285ae5 16-Jul-2011 Zack Kirsch <zack@FreeBSD.org>

Add DEXITCODE plumbing to NFS.

Isilon has the concept of an in-memory exit-code ring that saves the last exit
code of a function and allows for stack tracing. This is very helpful when
debugging tough issues.

This patch is essentially a no-op for BSD at this point, until we upstream
the dexitcode logic itself. The patch adds DEXITCODE calls to every NFS
function that returns an errno error code. A number of code paths were also
reorganized to have single exit paths, to reduce code duplication.

Submitted by: David Kwan <dkwan@isilon.com>
Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


# 68347a92 16-Jul-2011 Zack Kirsch <zack@FreeBSD.org>

Simple find/replace of VOP_ISLOCKED -> NFSVOPISLOCKED. This is done so that NFSVOPISLOCKED can be modified later to add enhanced logging and assertions.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


# a9989634 16-Jul-2011 Zack Kirsch <zack@FreeBSD.org>

Simple find/replace of VOP_UNLOCK -> NFSVOPUNLOCK. This is done so that NFSVOPUNLOCK can be modified later to add enhanced logging and assertions.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


# 98f234f3 16-Jul-2011 Zack Kirsch <zack@FreeBSD.org>

Simple find/replace of vn_lock -> NFSVOPLOCK. This is done so that NFSVOPLOCK can be modified later to add enhanced logging and assertions.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


# ff29f3b2 27-May-2011 Rick Macklem <rmacklem@FreeBSD.org>

Fix the new NFS client so that it handles NFSv4 state
correctly during a forced dismount. This required that
the exclusive and shared (refcnt) sleep lock functions check
for MNTK_UMOUNTF before sleeping, so that they won't block
while nfscl_umount() is getting rid of the state. As
such, a "struct mount *" argument was added to the locking
functions. I believe the only remaining case where a forced
dismount can get hung in the kernel is when a thread is
already attempting to do a TCP connect to a dead server
when the krpc client structure called nr_client is NULL.
This will only happen just after a "mount -u" with options
that force a new TCP connection is done, so it shouldn't
be a problem in practice.

MFC after: 2 weeks


# 806e2e4b 10-Apr-2011 Rick Macklem <rmacklem@FreeBSD.org>

Add some cleanup code to the module unload operation for
the experimental NFS server, so that it doesn't leak memory
when unloaded. However, unloading the NFSv4 server is not
recommended, since all NFSv4 state will be lost by the unload
and clients will have to recover the state after a server
reload/restart as if the server crashed/rebooted.

MFC after: 2 weeks


# 5f73287a 14-Jan-2011 Rick Macklem <rmacklem@FreeBSD.org>

Modify the experimental NFSv4 server so that it posts a SIGUSR2
signal to the master nfsd daemon whenever the stable restart
file has been modified. This will allow the master nfsd daemon
to maintain an up to date backup copy of the file. This is
enabled via the nfssvc() syscall, so that older nfsd daemons
will not be signaled.

Reviewed by: jhb
MFC after: 1 week


# 770b49a3 12-Jan-2011 Zack Kirsch <zack@FreeBSD.org>

In the experimental NFS server, when converting an open-owner to a lock-owner,
start at sequence id 1 instead of 0, to match up with both Solaris and Linux.

Reviewed by: rmacklem
Approved by: zml (mentor)


# fbf0af3f 06-Jan-2011 Rick Macklem <rmacklem@FreeBSD.org>

Delete the NFS_STARTWRITE() and NFS_ENDWRITE() macros that
obscured vn_start_write() and vn_finished_write() for the
old OpenBSD port, since most uses have been replaced by the
correct calls.

MFC after: 12 days


# 629fa50e 02-Jan-2011 Rick Macklem <rmacklem@FreeBSD.org>

Add checks for VI_DOOMED and vn_lock() failures to the
experimental NFS server, to handle the case where an
exported file system is forced dismounted while an RPC
is in progress. Further commits will fix the cases where
a mount point is used when the associated vnode isn't locked.

Reviewed by: kib
MFC after: 2 weeks


# 5a12538b 01-Jan-2011 Rick Macklem <rmacklem@FreeBSD.org>

Add support for shared vnode locks for the Read operation
in the experimental NFSv4 server.

Reviewed by: kib
MFC after: 2 weeks


# d6ec8427 17-Dec-2010 Rick Macklem <rmacklem@FreeBSD.org>

Fix two vnode locking problems in nfsd_recalldelegation() in the
experimental NFSv4 server. The first was a bogus use of VOP_ISLOCKED()
in a KASSERT() and the second was the need to lock the vnode for the
nfsrv_checkremove() call. Also, delete a "__unused" that was bogus,
since the argument is used.

Reviewed by: zack.kirsch at isilon.com
MFC after: 2 weeks


# b4a8d952 09-Dec-2010 Rick Macklem <rmacklem@FreeBSD.org>

Disable attempts to establish a callback connection from the
experimental NFSv4 server to a NFSv4 client when delegations are not
being issued, even if the client advertises a callback path.
This avoids a problem where a Linux client advertises a
callback path that doesn't work, due to a firewall, and then
times out an Open attempt before the FreeBSD server gives up
its callback connection attempt. (Suggested by
drb at karlov.mff.cuni.cz to fix the Linux client problem that
he reported on the fs-stable mailing list.)
The server should probably have
a 1sec timeout on callback connection attempts when there are
no delegations issued to the client, but that patch will require
changes to the krpc and this serves as a work around until then.

Tested by: drb at karlov.mff.cuni.cz
MFC after: 5 days


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# db0a33d2 11-Oct-2010 Rick Macklem <rmacklem@FreeBSD.org>

Try and make the nfsrv_localunlock() function in the experimental
NFSv4 server more readable. Mostly changes to comments, but a
case of >= is changed to >, since == can never happen. Also, I've
added a couple of KASSERT()s and a slight optimization, since
once the "else if" case happens, subsequent locks in the list can't
have any effect. None of these changes fixes any known bug.

MFC after: 2 weeks


# a212c01a 18-Sep-2010 Rick Macklem <rmacklem@FreeBSD.org>

Fix nfsrv_freeallnfslocks() in the experimental NFSv4 server so that
it frees local locks correctly upon close. In order for
nfsrv_localunlock() to work correctly, the lock can no longer be in
the lockowner's stateid list. As such, nfsrv_freenfslock() has to
be called before nfsrv_localunlock(), to get rid of the lock structure
on the lockowner's stateid list. This only affected operation when
local locks (vfs.newnfs.enable_locallocks=1) are enabled, which is
not the default at this time.

MFC after: 1 week


# 2c6d0e01 10-Sep-2010 Rick Macklem <rmacklem@FreeBSD.org>

This patch applies one of the two fixes suggested by
zack.kirsch at isilon.com for a race between nfsrv_freeopen()
and nfsrv_getlockfile() in the experimental NFS server that
he found during testing. Although nfsrv_freeopen() holds a
sleep lock on the lock file structure when called with
cansleep != 0, nfsrv_getlockfile() could still search the
list, once it acquired the NFSLOCKSTATE() mutex. I believe
that acquiring the mutex in nfsrv_freeopen() fixes the race.

MFC after: 2 weeks


# b5cb66df 28-Aug-2010 Rick Macklem <rmacklem@FreeBSD.org>

Add acquisition of a reference count on nfsv4root_lock to the
nfsd_recalldelegation() function, since this function is called
by nfsd threads when they are handling NFSv2 or NFSv3 RPCs, where
no reference count would have been acquired.

MFC after: 2 weeks


# 2ec3f925 28-Aug-2010 Rick Macklem <rmacklem@FreeBSD.org>

The timer routine in the experimental NFS server did not acquire
the correct mutex when checking nfsv4root_lock. Although this
could be fixed by adding mutex lock/unlock calls, zack.kirsch at
isilon.com suggested a better fix that uses a non-blocking
acquisition of a reference count on nfsv4root_lock. This fix
allows the weird NFSLOCKSTATE(); NFSUNLOCKSTATE(); synchronization
to be deleted. This patch applies this fix.

Tested by: zack.kirsch at isilon.com
MFC after: 2 weeks


# 2cf552b1 16-Jul-2010 Rick Macklem <rmacklem@FreeBSD.org>

Patch the experimental NFSv4 server so that it acquires a reference
count on nfsv4rootfs_lock when dumping state, since these functions
are not called by nfsd threads. Without this reference count, it
is possible for an nfsd thread to acquire an exclusive lock on
nfsv4rootfs_lock while the dump is in progress and then change the
lists, potentially causing a crash.

Reported by: zack.kirsch at isilon.com
MFC after: 2 weeks


# 866e6c5a 15-Jul-2010 Rick Macklem <rmacklem@FreeBSD.org>

Delete comments related to soft clock interrupts that don't apply
to the FreeBSD port of the experimental NFSv4 server.

Submitted by: zack.kirsch at isilon.com
MFC after: 2 weeks


# 63f6e5bf 14-Jul-2010 Rick Macklem <rmacklem@FreeBSD.org>

This patch fixes a bug in the experimental NFSv4 server where it
released a reference count on nfsv4rootfs_lock erroneously when
administrative revocation of state was done.

Submitted by: zack.kirsch at isilon.com
MFC after: 2 weeks


# 95b1c51b 13-Jul-2010 Rick Macklem <rmacklem@FreeBSD.org>

Fix a bogus comment that mentions lru lists that don't exist.

Reported by: zack.kirsch at isilon.com
MFC after: 2 weeks


# 227b9ebe 30-Apr-2010 Rick Macklem <rmacklem@FreeBSD.org>

MFC: r207170
An NFSv4 server will reply NFSERR_GRACE for non-recovery RPCs
during the grace period after startup. This grace period must
be at least the lease duration, which is typically 1-2 minutes.
It seems prudent for the experimental NFS client to wait a few
seconds before retrying such an RPC, so that the server isn't
flooded with non-recovery RPCs during recovery. This patch adds
an argument to nfs_catnap() to implement a 5 second delay
for this case.


# 23f929df 24-Apr-2010 Rick Macklem <rmacklem@FreeBSD.org>

An NFSv4 server will reply NFSERR_GRACE for non-recovery RPCs
during the grace period after startup. This grace period must
be at least the lease duration, which is typically 1-2 minutes.
It seems prudent for the experimental NFS client to wait a few
seconds before retrying such an RPC, so that the server isn't
flooded with non-recovery RPCs during recovery. This patch adds
an argument to nfs_catnap() to implement a 5 second delay
for this case.

MFC after: 1 week


# 066adacf 13-Apr-2010 Rick Macklem <rmacklem@FreeBSD.org>

MFC: r205941
This patch should fix handling of byte range locks locally
on the server for the experimental nfs server. When enabled
by setting vfs.newnfs.locallocks_enable to non-zero, the
experimental nfs server will now acquire byte range locks
on the file on behalf of NFSv4 clients, such that lock
conflicts between the NFSv4 clients and processes running
locally on the server, will be recognized and handled correctly.


# a43fcbe3 30-Mar-2010 Rick Macklem <rmacklem@FreeBSD.org>

This patch should fix handling of byte range locks locally
on the server for the experimental nfs server. When enabled
by setting vfs.newnfs.locallocks_enable to non-zero, the
experimental nfs server will now acquire byte range locks
on the file on behalf of NFSv4 clients, such that lock
conflicts between the NFSv4 clients and processes running
locally on the server, will be recognized and handled correctly.

MFC after: 2 weeks


# 3ed680d8 19-Feb-2010 Rick Macklem <rmacklem@FreeBSD.org>

MFC: r203849
Change the default value for vfs.newnfs.enable_locallocks to 0 for
the experimental NFS server, since local locking is known to be
broken and the patch to fix it is still a work in progress.


# 9c360f22 13-Feb-2010 Rick Macklem <rmacklem@FreeBSD.org>

Change the default value for vfs.newnfs.enable_locallocks to 0 for
the experimental NFS server, since local locking is known to be
broken and the patch to fix it is still a work in progress.

MFC after: 5 days


# 8c271f67 17-Jan-2010 Rick Macklem <rmacklem@FreeBSD.org>

MFC: r201442
The test for "same client" for the experimental nfs server over NFSv4
was broken w.r.t. byte range lock conflicts when it was the same client
and the request used the open_to_lock_owner4 case, since lckstp->ls_clp
was not set. This patch fixes it by using "clp" instead of "lckstp->ls_clp".


# 39682934 03-Jan-2010 Rick Macklem <rmacklem@FreeBSD.org>

The test for "same client" for the experimental nfs server over NFSv4
was broken w.r.t. byte range lock conflicts when it was the same client
and the request used the open_to_lock_owner4 case, since lckstp->ls_clp
was not set. This patch fixes it by using "clp" instead of "lckstp->ls_clp".

MFC after: 2 weeks


# 838d9858 19-Jun-2009 Brooks Davis <brooks@FreeBSD.org>

Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.

Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.

Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867


# 47b7dc99 16-Jun-2009 Rick Macklem <rmacklem@FreeBSD.org>

Remove the "int *" typecast for the aresid argument to vn_rdwr()
and change the type of the argument from size_t to int. This
should avoid issues on 64bit architectures.

Suggested by: kib
Approved by: kib (mentor)


# ed41a7cc 22-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Change the printf of r192595 to identify the function,
as requested by Sam.

Approved by: kib (mentor)


# 47617400 22-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Modified the printf message of r192590 to remove the
possible DOS attack, as suggested by Sam.

Approved by: kib (mentor)


# 8757104e 22-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Change the comment at the beginning of the function to reflect the
change from panic() to printf() done by r192588.


# 199685bc 22-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Change the reboot panic that would have occurred if clientid
numbers wrapped around to a printf() warning of a possible
DOS attack, in the experimental nfsv4 server.

Approved by: kib (mentor)


# bac9ff34 21-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Fix the comment at line 3711 to be consistent with the change
applied for r192537.

Approved by: kib (mentor)


# 29e890f1 20-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Although it should never happen, all the nfsv4 server can do
when it runs out of clientids is reboot. I had replaced cpu_reboot()
with printf(), since cpu_reboot() doesn't exist for sparc64.
This change replaces the printf() with panic(), so the reboot
would occur for this highly unlikely occurrence.

Approved by: kib (mentor)


# 15e8331f 15-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Fixed the Null callback RPCs so that they work with the new krpc. This
required two changes: setting the program and version numbers before
connect and fixing the handling of the Null Rpc case in newnfs_request().

Approved by: kib (mentor)


# 9ec7b004 04-May-2009 Rick Macklem <rmacklem@FreeBSD.org>

Add the experimental nfs subtree to the kernel, that includes
support for NFSv4 as well as NFSv2 and 3.
It lives in 3 subdirs under sys/fs:
nfs - functions that are common to the client and server
nfsclient - a mutation of sys/nfsclient that call generic functions
to do RPCs and handle state. As such, it retains the
buffer cache handling characteristics and vnode semantics that
are found in sys/nfsclient, for the most part.
nfsserver - the server. It includes a DRC designed specifically for
NFSv4, that is used instead of the generic DRC in sys/rpc.
The build glue will be checked in later, so at this point, it
consists of 3 new subdirs that should not affect kernel building.

Approved by: kib (mentor)