History log of /freebsd-10-stable/sys/fs/nfsserver/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
358035 17-Feb-2020 rmacklem

MFC: r357149
Fix a crash in the NFSv4 server.

The PR reported a crash that occurred when a file was removed while
client(s) were actively doing lock operations on it.
Since nfsvno_getvp() will return NULL when the file does not exist,
the bug was obvious and easy to fix via this patch. It is a little
surprising that this wasn't found sooner, but I guess the above
case rarely occurs.

PR: 242768

347040 03-May-2019 rmacklem

MFC: r346365
Fix the NFSv4.0 server so that it does not support NFSv4.1 attributes.

During inspection of a packet trace, I noticed that an NFSv4.0 mount
reported that it supported attributes that are only defined for NFSv4.1.
In practice, this bug appears to be benign, since NFSv4.0 clients will
not use attributes that were added for NFSv4.1.
However, this was not correct and this patch fixes the NFSv4.0 server
so that it only supports attributes defined for NFSv4.0.
It also adds a definition for NFSv4.1 attributes that can only be set,
although it is only defined as 0 for now.
This is anticipation of the addition of support for the NFSv4.1 mode+mask
attribute soon.

346779 27-Apr-2019 rmacklem

MFC: r346191
Add support for INET6 addresses to the kernel code that dumps open/lock state.

PR#223036 reported that INET6 callback addresses were not printed by
nfsdumpstate(8). This kernel patch adds INET6 addresses to the dump structure,
so that nfsdumpstate(8) can print them out, post-r346190.

340184 06-Nov-2018 avg

MFC r339595: nfsrvd_readdirplus: for some errors, do not fail the entire request

Sponsored by: Panzura

338132 21-Aug-2018 rmacklem

MFC: r336839
Modify the NFSv4.1 server so that it allows ReclaimComplete as done by ESXi 6.7.

I believe that a ReclaimComplete with rca_one_fs == TRUE is only
to be used after a file system has been transferred to a different
file server. However, RFC5661 is somewhat vague w.r.t. this and
the ESXi 6.7 client does both a ReclaimComplete with rca_one_fs == TRUE
and one with ReclaimComplete with rca_one_fs == FALSE.
Therefore, just ignore the rca_one_fs == TRUE operation and return
NFS_OK without doing anything instead of replying NFS4ERR_NOTSUPP.
This allows the ESXi 6.7 NFSv4.1 client to do a mount.
After discussion on the NFSv4 IETF working group mailing list, doing this
along with setting a flag to note that a ReclaimComplete with rca_one_fs TRUE
was an appropriate way to handle this.
The flag that indicates that a ReclaimComplete with rca_one_fs == TRUE was
done may be used to disable replies of NFS4ERR_GRACE for non-reclaim
state operations in a future commit.

This patch along with r332790, r334492 and r336357 allow ESXi 6.7 NFSv4.1 mounts
work ok. ESX 6.5 NFSv4.1 mounts do not work well, due to what I believe are
violations of RFC-5661 and should not be used.

337058 01-Aug-2018 rmacklem

MFC: r336357
Modify the reasons for not issuing a delegation in the NFSv4.1 server.

The ESXi NFSv4.1 client will generate warning messages when the reason for
not issuing a delegation is two. Two refers to a resource limit and I do
not see why it would be considered invalid. However it probably was not the
best choice of reason for not issuing a delegation.
This patch changes the reasons used to ones that the ESXi client doesn't
complain about. This change does not affect the FreeBSD client and does
not appear to affect behaviour of the Linux NFSv4.1 client.
RFC5661 defines these "reasons" but does not give any guidance w.r.t. which
ones are more appropriate to return to a client.

337006 31-Jul-2018 rmacklem

MFC: r336215
Ignore the cookie verifier for NFSv4.1 when the cookie is 0.

RFC5661 states that the cookie verifier should be 0 when the cookie is 0.
However, the wording is somewhat unclear and a recent discussion on the
nfsv4@ietf.org mailing list indicated that the NFSv4 server should ignore
the cookie verifier's value when the dirctory offset cookie is 0.
This patch deletes the check for this that would return NFSERR_BAD_COOKIE
when the verifier was not 0.
This was found during testing of the ESXi client against the NFSv4.1 server.

336846 28-Jul-2018 rmacklem

MFC: r334492
Add the BindConnectiontoSession operation to the NFSv4.1 server.

Under some fairly unusual circumstances, the Linux NFSv4.1 client is
doing a BindConnectiontoSession operation for TCP connections.
It is also used by the ESXi6.5 NFSv4.1 client.
This patch adds this operation to the NFSv4.1 server.

PR: 226493

336518 19-Jul-2018 rmacklem

MFC: r333766
Add a missing nfsrv_freesession() call for an unlikely failure case.

Since NFSv4.1 clients normally create a single session which supports
both fore and back channels, it is unlikely that a callback will fail
due to a lack of a back channel.
However, if this failure occurred, the session wasn't being dereferenced
and would never be free'd.
Found by inspection during pNFS server development.

336422 17-Jul-2018 rmacklem

MFC: r333645
End grace for the NFSv4 server if all mounts do ReclaimComplete.

The NFSv4 protocol requires that the server only allow reclaim of state
and not issue any new open/lock state for a grace period after booting.
The NFSv4.0 protocol required this grace period to be greater than the
lease duration (over 2minutes). For NFSv4.1, the client tells the server
that it has done reclaiming state by doing a ReclaimComplete operation.
If all NFSv4 clients are NFSv4.1, the grace period can end once all the
clients have done ReclaimComplete, shortening the time period considerably.
This patch does this. If there are any NFSv4.0 mounts, the grace period
will still be over 2minutes.
This change is only an optimization and does not affect correct operation.

336234 12-Jul-2018 rmacklem

MFC: r333579
The NFSv4.1 server should return NFSERR_BACKCHANBUSY instead of NFS_OK.

When an NFSv4.1 session is busy due to a callback being in progress,
nfsrv_freesession() should return NFSERR_BACKCHANBUSY instead of NFS_OK.
The only effect this has is that the DestroySession operation will report
the failure for this case and this probably has little or no effect on a
client. Spotted by inspection and no failures related to this have been
reported.

336179 10-Jul-2018 rmacklem

MFC: r333508
Add support for the TestStateID operation to the NFSv4.1 server.

The Linux client now uses the TestStateID operation, so this patch adds
support for it to the NFSv4.1 server. The FreeBSD client never uses this
operation, so it should not be affected.

334741 06-Jun-2018 rmacklem

MFC: r333580
Fix a slow leak of session structures in the NFSv4.1 server.

For a fairly rare case of a client doing an ExchangeID after a hard reboot,
the old confirmed clientid still exists, but some clients use a new
co_verifier. For this case, the server was not freeing up the sessions on
the old confirmed clientid.
This patch fixes this case. It also adds two LIST_INIT() macros, which are
actually no-ops, since the structure is malloc()d with M_ZERO so the pointer
is already set to NULL.
It should have minimal impact, since the only way I could exercise this
code path was by doing a hard power cycle (pulling the plus) on a machine
running Linux with a NFSv4.1 mount on the server.
Originally spotted during testing of the ESXi 6.5 client.

PR: 228497

334699 06-Jun-2018 rmacklem

MFC: r334396
Strengthen locking for the NFSv4.1 server DestroySession operation.

If a client did a DestroySession on a session while it was still in use,
the server might try to use the session structure after it is free'd.
I think the client has violated RFC5661 if it does this, but this patch
makes DestroySession block all other nfsd threads so no thread could
be using the session when it is free'd. After the DestroySession, nfsd
threads will not be able to find the session. The patch also adds a check
for nd_sessionid being set, although if that was not the case it would have
been all 0s and unlikely to have a false match.
This might fix the crashes described in PR#228497 for the FreeNAS server.

PR: 228497

334633 04-Jun-2018 rmacklem

MFC: r333592
Fix the eir_server_scope reply argument for NFSv4.1 ExchangeID.

In the reply to an ExchangeID operation, the NFSv4.1 server returns a
"scope" value (eir_server_scope). If this value is the same, it indicates
that two servers share state, which is never the case for FreeBSD servers.
As such, the value needs to be unique and it was without this patch.
However, I just found out that it is not supposed to change when the
server reboots and without this patch, it did change.
This patch fixes eir_server_scope so that it does not change when the
server is rebooted.
The only affect not having this patch has is that Linux clients don't
reclaim opens and locks after a server reboot, which meant they lost
any byte range locks held before the server rebooted.
It only affects NFSv4.1 mounts and the FreeBSD NFSv4.1 client was not
affected by this bug.

327008 19-Dec-2017 rmacklem

MFC: r326544
Avoid the overhead of acquiring a lock in nfsrv_checkgetattr() when
there are no write delegations issued.

manu@ reported on the freebsd-current@ mailing list that there was
a significant performance hit in nfsrv_checkgetattr() caused by
the acquisition/release of a state lock, even when there were no
write delegations issued.
This patch add a count of outstanding issued write delegations to the
NFSv4 server. This count allows nfsrv_checkgetattr() to return without
acquiring any lock when the count is 0, avoiding the performance hit
for the case where no write delegations are issued.

325448 05-Nov-2017 rmacklem

MFC: r324639
Fix the client IP address reported by nfsdumpstate for 64bit arch and NFSv4.1.

The client IP address was not being reported for some NFSv4 mounts by
nfsdumpstate. Upon investigation, two problems were found for mounts
using IPv4. One was that the code (originally written and tested on i386)
assumed that a "u_long" was a "uint32_t" and would exactly store an
IPv4 host address. Not correct for 64bit arches.
Also, for NFSv4.1 mounts, the field was not being filled in. This was
basically correct, because NFSv4.1 does not use a callback address.
However, it meant that nfsdumpstate could not report the client IP addr.
This patch should fix both of these issues.
For IPv6, the address will still not be reported. The original NFSv4 RFC
only specified IPv4 callback addresses. I think this has changed and, if so,
a future commit to fix reporting of IPv6 addresses will be needed.

324544 11-Oct-2017 rmacklem

MFC: r323978
Change a panic to an error return.

There was a panic() in the NFS server's write operation that didn't
need to be a panic() and could just be an error return.
This patch makes that change.
Found by code inspection during development of the pNFS service.

322138 07-Aug-2017 mav

MFC r321794: Improve FHA locality control for NFS read/write requests.

This change adds two new tunables, allowing to control serialization for
read and write NFS requests separately. It does not change the default
behavior since there are too many factors to consider, but gives additional
space for further experiments and tuning.

The main motivation for this change is very low write speed in case of ZFS
with sync=always or when NFS clients requests sychronous operation, when
every separate request has to be written/flushed to ZIL, and requests are
processed one at a time. Setting vfs.nfsd.fha.write=0 in that case allows
to increase ZIL throughput by several times by coalescing writes and cache
flushes. There is a worry that doing it may increase data fragmentation
on disks, but I suppose it should not happen for pool with SLOG.

Sponsored by: iXsystems, Inc.

317986 08-May-2017 rmacklem

MFC: r317382
Allow use of a write open stateid for reading in the NFSv4 server.

The NFSv4 RFCs give a server the option of allowing the use of an open
stateid for write access to be used for a Read operation.
This patch enables this by default and adds a sysctl to disable it,
for anyone who does not want this capability.
Allowing this is particularily useful for a pNFS Data Server (DS), since
they are not permitted to allow the use of special stateids.
Discovered during recent testing of the pNFS server under development.

317914 07-May-2017 rmacklem

MFC: r317236
Fix the setting of atime for Linux client NFSv4 mounts.

The FreeBSD NFSv4 server did not set the attribute bit for TimeAccess in
the reply to an Open with exclusive_create, as required by the RFCs.
(This is required since the FreeBSD NFS server stores the create_verifier
in the va_atime attribute.)
As such, the Linux NFSv4 client did not set the TimeAccess (atime) in
the Setattr done in an RPC after the one with the Open/exclusive_create.
This patch fixes the server to set the TimeAccess bit in the reply.

I believe that storing the create_verifier in an extended attribute for
file systems that support extended attributes might be a good idea,
but I will wait for a discussion of this on the freebsd-fs@ email list
before considering committing a patch to do this.

314034 21-Feb-2017 avg

MFC r313735: add svcpool_close to handle killed nfsd threads

PR: 204340
Reported by: Panzura
Reviewed by: rmacklem
Approved by: rmacklem

312421 19-Jan-2017 jpaetzel

MFC 311122

Workaround NFS bug with readdirplus when there are greater than 1 billion files in a filesystem.

Reviewed by: kib

310303 19-Dec-2016 rmacklem

MFC: r309566
Fix the NFSv4.1 server for Open reclaim after a reboot.

The NFSv4.1 server failed to update the nfs-stablerestart file for
a client when the client was issued its first Open. As such, recovery
of Opens after a server reboot failed with NFSERR_NOGRACE.
This patch fixes this.
It also changes the code so that it malloc()'s the 1024 byte array
instead of allocating it on the kernel stack for both NFSv4.0 and NFSv4.1.
Note that this bug only affected NFSv4.1 and only when clients attempted
to reclaim Opens after a server reboot.

308241 03-Nov-2016 rmacklem

MFC: r307694
A problem w.r.t. interoperation between the FreeBSD NFSv4.1 server with
delegations enabled and the Linux NFSv4.1 client was reported in
reviews.freebsd.org/D7891.
I believe that the FreeBSD server behaviour conforms to the RFC and that
the Linux client has a bug. Therefore, I do not think the proposed patch
is appropriate. When nfsrv_writedelegifpos is non-zero, the FreeBSD
server will issue a write delegation for a read open if possible.
The Linux client then erroneously assumes that the credentials used for
the read open can write the file.
This patch reverses the default value for nfsrv_writedelegifpos to 0 so
that the default behaviour is Linux compatible and adds a sysctl that can
be used to set nfsrv_writedelegifpos.

This change should only affect users that are mounting a FreeBSD server
with delegations enabled (they are not enabled by default) with a Linux
NFSv4.1 client mount.

306663 03-Oct-2016 rmacklem

Revert r306659 since the userland changes won't merge and this would
break the build.

306659 03-Oct-2016 rmacklem

MFC: r304026
Update the nfsstats structure to include the changes needed by
the patch in D1626 plus changes so that it includes counts for
NFSv4.1 (and the draft of NFSv4.2).
Also, make all the counts uint64_t and add a vers field at the
beginning, so that future revisions can easily be implemented.
There is code in place to handle the old vesion of the nfsstats
structure for backwards binary compatibility.

Subsequent commits will update nfsstat(8) to use the new fields.

300778 26-May-2016 rmacklem

MFC: r299514
Fix use-after-free in NFS4 lock test service.

Trivial use-after-free where stp was freed too soon in the non-error path.
To fix, simply move its release to the end of the routine.

300379 21-May-2016 rmacklem

MFC: r299226
Don't increment srvrpccnt[] for the NFSv4.1 operations.

When support for NFSv4.1 was added to the NFS server, it broke
the server rpc count stats, since newnfsstats.srvrpccnt[] doesn't
have entries for the new NFSv4.1 operations.
Without this patch, the code was incrementing bogus entries in
newnfsstats for the new NFSv4.1 operations.
This patch is an interim fix. The nfsstats structure needs to be
updated and that will come in a future commit.

300254 20-May-2016 rmacklem

MFC: r299201
Give mountd -S priority over outstanding RPC requests when suspending the nfsd.

It was reported via email that under certain heavy RPC loads
long delays before the exports would be updated was observed
when using "mountd -S". This patch reverses the priority between
the exclusive lock request to suspend the nfsd threads and the
shared lock request for performing RPCs.
As such, when mountd attempts to suspend the nfsd threads, it
gets priority over outstanding RPC requests to do this.
I suspect that the case reported was an artificial test load,
but this patch did fix the problem for the reporter.

299223 07-May-2016 rmacklem

MFC: r298523
Allow the NFSv4 server to reply NFSERR_WRONGSEC for the SetClientID operation.

It was reported via email that a Linux client couldn't do a Kerberized
NFS mount when only "sec=krb5" was specified for the exports. The Linux
client attempted a mount via krb5i and the server replied NFSERR_SERVERFAULT.
Although NFSERR_WRONGSEC isn't listed as an error for SetClientID, I
think it is the correct reply, so this patch enables that.
I do not know if this fixes the mount attempt, but adding "krb5i" to the
list of allowed security flavours does allow the mount to work.

299222 07-May-2016 rmacklem

MFC: r298495
Fix a LOR in the NFSv4.1 server.

The ordering of acquisition of the state and session mutexes was
reversed in two cases executed when an NFSv4.1 client created/freed
a session. Since clients will typically do this only when mounting
and dismounting, the likelyhood of causing a deadlock was low but possible.
This can only occur for NFSv4.1 mounts, since the others do not
use sessions.
This was detected while testing the pNFS server/client where the
client crashed during dismounting.
The patch also reorders the unlocks, although that isn't necessary
for correct operation.

299208 07-May-2016 rmacklem

MFC: r297869
If the VOP_SETATTR() call that saves the exclusive create verifier failed,
the NFS server would leave the newly created vnode locked. This could
result in a file system that would not unmount and processes wedged,
waiting for the file to be unlocked.
Since this VOP_SETATTR() never fails for most file systems, this bug
doesn't normally manifest itself. I found it during testing of an
exported GlusterFS file system, which can fail.
This patch adds the vput() and changes the error to the correct NFS one.

292223 14-Dec-2015 rmacklem

MFC: r291527
Add kernel support to the NFS server for the "-manage-gids"
option that will be added to the nfsuserd daemon in a future
commit. It modifies the cache used by NFSv4 for name<-->id
translation (both username/uid and group/gid) to support this.
When "-manage-gids" is set, the server looks up each uid
for the RPC and uses the list of groups cached in the server
instead of the list of groups provided in the RPC request.
The cached group list is acquired for the cache by the nfsuserd
daemon via getgrouplist(3).
This avoids the 16 groups limit for the list in the RPC request.
Since the cache is now used for every RPC when "-manage-gids"
is enabled, the code also modifies the cache to use a separate
mutex for each hash list instead of a single global mutex.

291869 05-Dec-2015 rmacklem

MFC: r291150
When the nfsd threads are terminated, the NFSv4 server state
(opens, locks, etc) is retained, which I believe is correct behaviour.
However, for NFSv4.1, the server also retained a reference to the xprt
(RPC transport socket structure) for the backchannel. This caused
svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed
a socket upcall to occur after the mutexes in the svcpool were destroyed,
causing a crash.
This patch fixes the code so that the backchannel xprt structure is
dereferenced just before svcpool_destroy() is called, so the code
does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.

287267 28-Aug-2015 rmacklem

MFC: r286790
For the case where an NFSv4.1 ExchangeID operation has the client identifier
that already has a confirmed ClientID, the nfsrv_setclient() function would
not fill in the clientidp being returned. As such, the value of ClientID
returned would be whatever garbage was on the stack.
This patch fixes the problem by filling in these fields.

286164 01-Aug-2015 rmacklem

MFC: r286046
This patch fixes a problem where, if the NFSv4 server has a previous
unconfirmed clientid structure for the same client on the last hash list,
this old entry would not be removed/deleted. I do not think this bug would have
caused serious problems, since the new entry would have been before the old one
on the list. This old entry would have eventually been scavenged/removed.

284334 12-Jun-2015 rmacklem

MFC: r283753
Make the NFS server use shared vnode locks for a few cases
that are allowed by the VFS/VOP interface instead of using
exclusive locks.

284318 12-Jun-2015 rmacklem

Add TUNABLE_INT() macros so that the tunables MFC'd from
head as r284216 are set via /boot/loader.conf in stable/10.
This is a direct commit to stable/10 because TUNABLE_INT()
is deprecated in head.

284216 10-Jun-2015 rmacklem

MFC: r283635
Make the size of the hash tables used by the NFSv4 server tunable.
No appreciable change in performance was observed after increasing
the sizes of these tables and then testing with a single client.
However, there was an email that indicated high CPU overheads for
a heavily loaded NFSv4 and it is hoped that increasing the sizes
of the hash tables via these tunables might help.
The tables remain the same size by default.

284199 10-Jun-2015 kib

MFC r283600:
Perform SU cleanup in the AST handler. Do not sleep waiting for SU cleanup
while owning vnode lock.

On MFC, for KBI stability, td_su member was moved to the end of the
struct thread.

282340 02-May-2015 rmacklem

MFC: r281962
Fix the NFS server's handling of a bogus NFSv2 ROOT RPC.
The ROOT RPC is deprecated in the NFSv2 RFC, RFC-1094
and should never be used by a client.

282271 30-Apr-2015 rmacklem

MFC: r281628
mav@ has found that NFS servers exporting ZFS file systems
can perform better when using a 128K read/write data size.
This patch changes NFS_MAXDATA from 64K to 128K so that
clients can use 128K for NFS mounts to allow this.
The patch also renames NFS_MAXDATA to NFS_SRVMAXIO so
that it is clear that it applies to the NFS server side
only. It also avoids a name conflict with the NFS_MAXDATA
defined in rpcsvc/nfs_prot.h, that is used for userland RPC.

282270 30-Apr-2015 rmacklem

MFC: r281562
File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

280258 19-Mar-2015 rwatson

Merge r263233 from HEAD to stable/10:

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

Sponsored by: Google, Inc.


/freebsd-10-stable/sys/amd64/amd64/sys_machdep.c
/freebsd-10-stable/sys/amd64/linux32/linux32_machdep.c
/freebsd-10-stable/sys/arm/arm/sys_machdep.c
/freebsd-10-stable/sys/cam/ctl/ctl_frontend_iscsi.c
/freebsd-10-stable/sys/cddl/compat/opensolaris/sys/file.h
/freebsd-10-stable/sys/compat/freebsd32/freebsd32_capability.c
/freebsd-10-stable/sys/compat/freebsd32/freebsd32_ioctl.c
/freebsd-10-stable/sys/compat/freebsd32/freebsd32_misc.c
/freebsd-10-stable/sys/compat/linux/linux_file.c
/freebsd-10-stable/sys/compat/linux/linux_ioctl.c
/freebsd-10-stable/sys/compat/linux/linux_socket.c
/freebsd-10-stable/sys/compat/svr4/svr4_fcntl.c
/freebsd-10-stable/sys/compat/svr4/svr4_filio.c
/freebsd-10-stable/sys/compat/svr4/svr4_ioctl.c
/freebsd-10-stable/sys/compat/svr4/svr4_misc.c
/freebsd-10-stable/sys/compat/svr4/svr4_stream.c
/freebsd-10-stable/sys/dev/aac/aac_linux.c
/freebsd-10-stable/sys/dev/aacraid/aacraid_linux.c
/freebsd-10-stable/sys/dev/amr/amr_linux.c
/freebsd-10-stable/sys/dev/filemon/filemon.c
/freebsd-10-stable/sys/dev/hwpmc/hwpmc_logging.c
/freebsd-10-stable/sys/dev/ipmi/ipmi_linux.c
/freebsd-10-stable/sys/dev/iscsi/icl.c
/freebsd-10-stable/sys/dev/iscsi/icl_proxy.c
/freebsd-10-stable/sys/dev/iscsi_initiator/iscsi.c
/freebsd-10-stable/sys/dev/mfi/mfi_linux.c
/freebsd-10-stable/sys/dev/tdfx/tdfx_linux.c
/freebsd-10-stable/sys/fs/fdescfs/fdesc_vnops.c
/freebsd-10-stable/sys/fs/fuse/fuse_vfsops.c
/freebsd-10-stable/sys/fs/nfsclient/nfs_clport.c
nfs_nfsdport.c
/freebsd-10-stable/sys/i386/i386/sys_machdep.c
/freebsd-10-stable/sys/i386/ibcs2/ibcs2_fcntl.c
/freebsd-10-stable/sys/i386/ibcs2/ibcs2_ioctl.c
/freebsd-10-stable/sys/i386/ibcs2/ibcs2_misc.c
/freebsd-10-stable/sys/i386/linux/linux_machdep.c
/freebsd-10-stable/sys/kern/imgact_elf.c
/freebsd-10-stable/sys/kern/kern_descrip.c
/freebsd-10-stable/sys/kern/kern_event.c
/freebsd-10-stable/sys/kern/kern_exec.c
/freebsd-10-stable/sys/kern/kern_exit.c
/freebsd-10-stable/sys/kern/kern_ktrace.c
/freebsd-10-stable/sys/kern/kern_sig.c
/freebsd-10-stable/sys/kern/kern_sysctl.c
/freebsd-10-stable/sys/kern/subr_capability.c
/freebsd-10-stable/sys/kern/subr_syscall.c
/freebsd-10-stable/sys/kern/subr_trap.c
/freebsd-10-stable/sys/kern/sys_capability.c
/freebsd-10-stable/sys/kern/sys_generic.c
/freebsd-10-stable/sys/kern/sys_procdesc.c
/freebsd-10-stable/sys/kern/tty.c
/freebsd-10-stable/sys/kern/uipc_mqueue.c
/freebsd-10-stable/sys/kern/uipc_sem.c
/freebsd-10-stable/sys/kern/uipc_shm.c
/freebsd-10-stable/sys/kern/uipc_syscalls.c
/freebsd-10-stable/sys/kern/uipc_usrreq.c
/freebsd-10-stable/sys/kern/vfs_acl.c
/freebsd-10-stable/sys/kern/vfs_aio.c
/freebsd-10-stable/sys/kern/vfs_extattr.c
/freebsd-10-stable/sys/kern/vfs_lookup.c
/freebsd-10-stable/sys/kern/vfs_syscalls.c
/freebsd-10-stable/sys/netsmb/smb_dev.c
/freebsd-10-stable/sys/nfsserver/nfs_srvkrpc.c
/freebsd-10-stable/sys/security/mac/mac_syscalls.c
/freebsd-10-stable/sys/sparc64/sparc64/sys_machdep.c
/freebsd-10-stable/sys/ufs/ffs/ffs_alloc.c
/freebsd-10-stable/sys/vm/vm_mmap.c
276500 01-Jan-2015 kib

MFC r275897:
Set NOCACHE flag for CREATE namei() calls, do not specially handle
MAKEENTRY in VOP_LOOKUP().

276437 31-Dec-2014 rmacklem

MFC: r276193
A deadlock in the NFSv4 server with vfs.nfsd.enable_locallocks=1
was reported via email. This was caused by a LOR between the
sleep lock used to serialize the local locking (nfsrv_locklf())
and locking the vnode. I believe this patch fixes the problem
by delaying relocking of the vnode until the sleep lock is
unlocked (nfsrv_unlocklf()). To avoid nfsvno_advlock() having the side
effect of unlocking the vnode, unlocking the vnode was moved to before
the functions that call nfsvno_advlock().
It shouldn't affect the execution of the default case where
vfs.nfsd.enable_locallocks=0.

274027 03-Nov-2014 kib

MFC r273727:
Original commit message was
Allow the vfs.nfsd knobs to be set from loader.conf (or using
kenv(8)). This is useful when nfsd is loaded as module.

As I understand, automatic fetch from kenv does not work in stable/10.
Merge the change still, to reduce code difference.

273877 31-Oct-2014 araujo

MFC r273159:
Add two sysctl(8) to enable/disable NFSv4 server to check when setting
user nobody and/or setting group nogroup as owner of a file or directory.
Usually at the client side, if there is an username that is not in the
client's passwd database, some clients will send 'nobody@<your.dns.domain>'
in the wire and the NFSv4 server will treat it as an ERROR.
However, if you have a valid user nobody in your passwd database,
the NFSv4 server will treat it as a NFSERR_BADOWNER as its believes the
client doesn't has the username mapped.

Submitted by: Loic Blot <loic.blot@unix-experience.fr>
Reviewed by: rmacklem
Approved by: rmacklem
Sponsored by: QNAP Systems Inc.

273736 27-Oct-2014 hselasky

MFC r263710, r273377, r273378, r273423 and r273455:

- De-vnet hash sizes and hash masks.
- Fix multiple issues related to arguments passed to SYSCTL macros.

Sponsored by: Mellanox Technologies


/freebsd-10-stable/sys/amd64/amd64/fpu.c
/freebsd-10-stable/sys/arm/arm/busdma_machdep-v6.c
/freebsd-10-stable/sys/arm/arm/busdma_machdep.c
/freebsd-10-stable/sys/cam/scsi/scsi_sa.c
/freebsd-10-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
/freebsd-10-stable/sys/cddl/dev/dtrace/dtrace_sysctl.c
/freebsd-10-stable/sys/compat/ndis/kern_ndis.c
/freebsd-10-stable/sys/dev/acpi_support/acpi_asus.c
/freebsd-10-stable/sys/dev/acpi_support/acpi_asus_wmi.c
/freebsd-10-stable/sys/dev/acpi_support/acpi_hp.c
/freebsd-10-stable/sys/dev/acpi_support/acpi_ibm.c
/freebsd-10-stable/sys/dev/acpi_support/acpi_rapidstart.c
/freebsd-10-stable/sys/dev/acpi_support/acpi_sony.c
/freebsd-10-stable/sys/dev/bxe/bxe.c
/freebsd-10-stable/sys/dev/cxgb/cxgb_sge.c
/freebsd-10-stable/sys/dev/cxgbe/t4_main.c
/freebsd-10-stable/sys/dev/e1000/if_em.c
/freebsd-10-stable/sys/dev/e1000/if_igb.c
/freebsd-10-stable/sys/dev/e1000/if_lem.c
/freebsd-10-stable/sys/dev/hatm/if_hatm.c
/freebsd-10-stable/sys/dev/ixgbe/ixgbe.c
/freebsd-10-stable/sys/dev/ixgbe/ixv.c
/freebsd-10-stable/sys/dev/ixl/if_ixl.c
/freebsd-10-stable/sys/dev/mpr/mpr.c
/freebsd-10-stable/sys/dev/mps/mps.c
/freebsd-10-stable/sys/dev/mrsas/mrsas.c
/freebsd-10-stable/sys/dev/mrsas/mrsas.h
/freebsd-10-stable/sys/dev/mxge/if_mxge.c
/freebsd-10-stable/sys/dev/oce/oce_sysctl.c
/freebsd-10-stable/sys/dev/qlxgb/qla_os.c
/freebsd-10-stable/sys/dev/qlxgbe/ql_os.c
/freebsd-10-stable/sys/dev/rt/if_rt.c
/freebsd-10-stable/sys/dev/sound/pci/hda/hdaa.c
/freebsd-10-stable/sys/dev/vxge/vxge.c
/freebsd-10-stable/sys/dev/xen/netfront/netfront.c
/freebsd-10-stable/sys/fs/devfs/devfs_devs.c
/freebsd-10-stable/sys/fs/fuse/fuse_main.c
/freebsd-10-stable/sys/fs/fuse/fuse_vfsops.c
nfs_nfsdkrpc.c
/freebsd-10-stable/sys/geom/geom_kern.c
/freebsd-10-stable/sys/kern/kern_cpuset.c
/freebsd-10-stable/sys/kern/kern_descrip.c
/freebsd-10-stable/sys/kern/kern_mib.c
/freebsd-10-stable/sys/kern/kern_synch.c
/freebsd-10-stable/sys/kern/subr_devstat.c
/freebsd-10-stable/sys/kern/subr_kdb.c
/freebsd-10-stable/sys/kern/subr_uio.c
/freebsd-10-stable/sys/kern/vfs_cache.c
/freebsd-10-stable/sys/mips/mips/busdma_machdep.c
/freebsd-10-stable/sys/net/if_lagg.c
/freebsd-10-stable/sys/net/pfvar.h
/freebsd-10-stable/sys/net80211/ieee80211_ht.c
/freebsd-10-stable/sys/net80211/ieee80211_hwmp.c
/freebsd-10-stable/sys/net80211/ieee80211_mesh.c
/freebsd-10-stable/sys/net80211/ieee80211_superg.c
/freebsd-10-stable/sys/netgraph/bluetooth/common/ng_bluetooth.c
/freebsd-10-stable/sys/netgraph/ng_base.c
/freebsd-10-stable/sys/netgraph/ng_socket.c
/freebsd-10-stable/sys/netinet/cc/cc_chd.c
/freebsd-10-stable/sys/netinet/tcp_reass.c
/freebsd-10-stable/sys/netipsec/ipsec.h
/freebsd-10-stable/sys/netipx/ipx_proto.c
/freebsd-10-stable/sys/netpfil/pf/if_pfsync.c
/freebsd-10-stable/sys/netpfil/pf/pf.c
/freebsd-10-stable/sys/netpfil/pf/pf_ioctl.c
/freebsd-10-stable/sys/ofed/drivers/net/mlx4/mlx4_en.h
/freebsd-10-stable/sys/powerpc/powermac/fcu.c
/freebsd-10-stable/sys/powerpc/powermac/smu.c
/freebsd-10-stable/sys/powerpc/powerpc/busdma_machdep.c
/freebsd-10-stable/sys/powerpc/powerpc/cpu.c
/freebsd-10-stable/sys/sys/sysctl.h
/freebsd-10-stable/sys/vm/memguard.c
/freebsd-10-stable/sys/vm/vm_kern.c
/freebsd-10-stable/sys/x86/x86/busdma_bounce.c
270066 16-Aug-2014 rmacklem

MFC: r269771
Change the NFS server's printf related to hitting
the DRC cache's flood level so that it suggests
increasing vfs.nfsd.tcphighwater.

269655 07-Aug-2014 kib

MFC r269347:
Do not generate 1000 unique lock names for nfsrc hash chain locks.
Shorten the names of some nfs mutexes.

269452 03-Aug-2014 rmacklem

MFC: r268273
The new NFSv3 server did not generate directory postop attributes for
the reply to ReaddirPlus when the server failed within the loop
that calls VFS_VGET(). This failure is most likely an error
return from VFS_VGET() caused by a bogus d_fileno that was
truncated to 32bits.
This patch fixes the server so that it will return directory postop
attributes for the failure. It does not fix the underlying issue caused
by d_fileno being uint32_t when a file system like ZFS generates
a fileno that is greater than 32bits.

269398 01-Aug-2014 rmacklem

MFC: r268115
Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

268961 21-Jul-2014 bdrewery

MFC r268114:

Change NFS readdir() to only ignore cookies preceding the given offset for
UFS rather than for all but ZFS.

267343 10-Jun-2014 rmacklem

MFC: r267191
The new NFS server would not allow a hard link to be
created to a symlink. This restriction (which was
inherited from OpenBSD) is not required by the NFS RFCs.
Since this is allowed by the old NFS server, it is a
POLA violation to not allow it. This patch modifies the
new NFS server to allow this.

265714 08-May-2014 rmacklem

MFC: r265252
The new draft specification for NFSv4.0 specifies that a server
should either accept owner and owner_group strings that are just
the digits of the uid/gid or return NFS4ERR_BADOWNER.
This patch adds a sysctl vfs.nfsd.enable_stringtouid, which can
be set to enable the server w.r.t. accepting numeric string. It
also ensures that NFS4ERR_BADOWNER is returned if numeric uid/gid
strings are not enabled. This fixes the server for recent Linux
nfs4 clients that use numeric uid/gid strings by default.

265667 08-May-2014 rmacklem

MFC: r264888
The PR reported that the old NFS server did not set uio_td == NULL
for the VOP_READ() call. This patch fixes both the old and new
server for this case.

265621 07-May-2014 rmacklem

MFC: r264845
Remove an unnecessary level of indirection for an argument.
This simplifies the code and should avoid the clang sparc
port from generating an abort() call.

264266 08-Apr-2014 delphij

Fix NFS deadlock vulnerability. [SA-14:05]

Fix "Heartbleed" vulnerability and ECDSA Cache Side-channel
Attack in OpenSSL. [SA-14:06]

261055 22-Jan-2014 mav

MFC r260229, r260258, r260367, r260390, r260459, r260648:
Rework NFS Duplicate Request Cache cleanup logic.

- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by: iXsystems, Inc.

261051 22-Jan-2014 mav

MFC r259877:
Slightly simplify expiration logic introduced in r254337.

- Do not update the histogram for items we are any way deleting from cache.
- Do not update the histogram if nfsrc_tcphighwater is not set.
- Remove some extra math operations.

261049 22-Jan-2014 mav

MFC r259765:
Fix RPC server threads file handle affinity to work better with ZFS.

Instead of taking 8 specific bytes of file handle to identify file during
RPC thread affitinity handling, use trivial hash of the full file handle.
ZFS's struct zfid_short does not have padding field after the length field,
as result, originally picked 8 bytes are loosing lower 16 bits of object ID,
causing many false matches and unneeded requests affinity to same thread.
This fix substantially improves NFS server latency and scalability in SPEC
NFS benchmark by more flexible use of multiple NFS threads.

260159 01-Jan-2014 rmacklem

MFC: r259854
The NFSv4 server would call VOP_SETATTR() with a shared locked vnode
when a Getattr for a file is done by a client other than the one that
holds the file's delegation. This would only happen when delegations
are enabled and the problem is fixed by this patch.

260144 31-Dec-2013 rmacklem

MFC: r259845
An intermittent problem with NFSv4 exporting of ZFS snapshots was
reported to the freebsd-fs mailing list. I believe the problem was
caused by the Readdir operation using VFS_VGET() for a snapshot file entry
instead of VOP_LOOKUP(). This would not occur for NFSv3, since it
will do a VFS_VGET() of "." which fails with ENOTSUPP at the beginning
of the directory, whereas NFSv4 does not check "." or "..". This
patch adds a call to VFS_VGET() for the directory being read to check
for ENOTSUPP.
I also observed that the mount_on_fileid and fsid attributes were
not correct at the snapshot's auto mountpoints when looking at packet
traces for the Readdir. This patch fixes the attributes by doing a check
for different v_mount structure, even if the vnode v_mountedhere is not
set.

256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


255219 05-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


254337 14-Aug-2013 rmacklem

Fix several performance related issues in the new NFS server's
DRC for NFS over TCP.
- Increase the size of the hash tables.
- Create a separate mutex for each hash list of the TCP hash table.
- Single thread the code that deletes stale cache entries.
- Add a tunable called vfs.nfsd.tcphighwater, which can be increased
to allow the cache to grow larger, avoiding the overhead of frequent
scans to delete stale cache entries.
(The default value will result in frequent scans to delete stale cache
entries, analagous to what the pre-patched code does.)
- Add a tunable called vfs.nfsd.cachetcp that can be used to disable
DRC caching for NFS over TCP, since the old NFS server didn't DRC cache TCP.
It also adjusts the size of nfsrc_floodlevel dynamically, so that it is
always greater than vfs.nfsd.tcphighwater.

For UDP the algorithm remains the same as the pre-patched code, but the
tunable vfs.nfsd.udphighwater can be used to allow the cache to grow
larger and reduce the overhead caused by frequent scans for stale entries.
UDP also uses a larger hash table size than the pre-patched code.

Reported by: wollman
Tested by: wollman (earlier version of patch)
Submitted by: ivoras (earlier patch)
Reviewed by: jhb (earlier version of patch)
MFC after: 1 month


251171 31-May-2013 jeff

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


250657 15-May-2013 des

Fix typo in comment.

Submitted by: Alex Weber <alexwebr@gmail.com>
MFC after: 1 week


250055 29-Apr-2013 des

Fix a bug that allows NFS clients to issue READDIR on files.

PR: kern/178016
Security: CVE-2013-3266
Security: FreeBSD-SA-13:05.nfsserver


249596 17-Apr-2013 ken

Move the NFS FHA (File Handle Affinity) code from sys/nfsserver to
sys/nfs, since it is now shared by the two NFS servers.

Suggested by: rmacklem
Sponsored by: Spectra Logic
MFC after: 2 weeks


249592 17-Apr-2013 ken

Revamp the old NFS server's File Handle Affinity (FHA) code so that
it will work with either the old or new server.

The FHA code keeps a cache of currently active file handles for
NFSv2 and v3 requests, so that read and write requests for the same
file are directed to the same group of threads (reads) or thread
(writes). It does not currently work for NFSv4 requests. They are
more complex, and will take more work to support.

This improves read-ahead performance, especially with ZFS, if the
FHA tuning parameters are configured appropriately. Without the
FHA code, concurrent reads that are part of a sequential read from
a file will be directed to separate NFS threads. This has the
effect of confusing the ZFS zfetch (prefetch) code and makes
sequential reads significantly slower with clients like Linux that
do a lot of prefetching.

The FHA code has also been updated to direct write requests to nearby
file offsets to the same thread in the same way it batches reads,
and the FHA code will now also send writes to multiple threads when
needed.

This improves sequential write performance in ZFS, because writes
to a file are now more ordered. Since NFS writes (generally
less than 64K) are smaller than the typical ZFS record size
(usually 128K), out of order NFS writes to the same block can
trigger a read in ZFS. Sending them down the same thread increases
the odds of their being in order.

In order for multiple write threads per file in the FHA code to be
useful, writes in the NFS server have been changed to use a LK_SHARED
vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem
doesn't allow multiple writers to a file at once. ZFS is currently
the only filesystem that allows multiple writers to a file, because
it has internal file range locking. This change does not affect the
NFSv4 code.

This improves random write performance to a single file in ZFS, since
we can now have multiple writers inside ZFS at one time.

I have changed the default tuning parameters to a 22 bit (4MB)
window size (from 256K) and unlimited commands per thread as a
result of my benchmarking with ZFS.

The FHA code has been updated to allow configuring the tuning
parameters from loader tunable variables in addition to sysctl
variables. The read offset window calculation has been slightly
modified as well. Instead of having separate bins, each file
handle has a rolling window of bin_shift size. This minimizes
glitches in throughput when shifting from one bin to another.

sys/conf/files:
Add nfs_fha_new.c and nfs_fha_old.c. Compile nfs_fha.c
when either the old or the new NFS server is built.

sys/fs/nfs/nfsport.h,
sys/fs/nfs/nfs_commonport.c:
Bring in changes from Rick Macklem to newnfs_realign that
allow it to operate in blocking (M_WAITOK) or non-blocking
(M_NOWAIT) mode.

sys/fs/nfs/nfs_commonsubs.c,
sys/fs/nfs/nfs_var.h:
Bring in a change from Rick Macklem to allow telling
nfsm_dissect() whether or not to wait for mallocs.

sys/fs/nfs/nfsm_subs.h:
Bring in changes from Rick Macklem to create a new
nfsm_dissect_nonblock() inline function and
NFSM_DISSECT_NONBLOCK() macro.

sys/fs/nfs/nfs_commonkrpc.c,
sys/fs/nfsclient/nfs_clkrpc.c:
Add the malloc wait flag to a newnfs_realign() call.

sys/fs/nfsserver/nfs_nfsdkrpc.c:
Setup the new NFS server's RPC thread pool so that it will
call the FHA code.

Add the malloc flag argument to newnfs_realign().

Unstaticize newnfs_nfsv3_procid[] so that we can use it in
the FHA code.

sys/fs/nfsserver/nfs_nfsdsocket.c:
In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types
that use the LK_SHARED lock type.

sys/fs/nfsserver/nfs_nfsdport.c:
In nfsd_fhtovp(), if we're starting a write, check to see
whether the underlying filesystem supports shared writes.
If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE.

sys/nfsserver/nfs_fha.c:
Remove all code that is specific to the NFS server
implementation. Anything that is server-specific is now
accessed through a callback supplied by that server's FHA
shim in the new softc.

There are now separate sysctls and tunables for the FHA
implementations for the old and new NFS servers. The new
NFS server has its tunables under vfs.nfsd.fha, the old
NFS server's tunables are under vfs.nfsrv.fha as before.

In fha_extract_info(), use callouts for all server-specific
code. Getting file handles and offsets is now done in the
individual server's shim module.

In fha_hash_entry_choose_thread(), change the way we decide
whether two reads are in proximity to each other.
Previously, the calculation was a simple shift operation to
see whether the offsets were in the same power of 2 bucket.
The issue was that there would be a bucket (and therefore
thread) transition, even if the reads were in close
proximity. When there is a thread transition, reads wind
up going somewhat out of order, and ZFS gets confused.

The new calculation simply tries to see whether the offsets
are within 1 << bin_shift of each other. If they are, the
reads will be sent to the same thread.

The effect of this change is that for sequential reads, if
the client doesn't exceed the max_reqs_per_nfsd parameter
and the bin_shift is set to a reasonable value (22, or
4MB works well in my tests), the reads in any sequential
stream will largely be confined to a single thread.

Change fha_assign() so that it takes a softc argument. It
is now called from the individual server's shim code, which
will pass in the softc.

Change fhe_stats_sysctl() so that it takes a softc
parameter. It is now called from the individual server's
shim code. Add the current offset to the list of things
printed out about each active thread.

Change the num_reads and num_writes counters in the
fha_hash_entry structure to 32-bit values, and rename them
num_rw and num_exclusive, respectively, to reflect their
changed usage.

Add an enable sysctl and tunable that allows the user to
disable the FHA code (when vfs.XXX.fha.enable = 0). This
is useful for before/after performance comparisons.

nfs_fha.h:
Move most structure definitions out of nfs_fha.c and into
the header file, so that the individual server shims can
see them.

Change the default bin_shift to 22 (4MB) instead of 18
(256K). Allow unlimited commands per thread.

sys/nfsserver/nfs_fha_old.c,
sys/nfsserver/nfs_fha_old.h,
sys/fs/nfsserver/nfs_fha_new.c,
sys/fs/nfsserver/nfs_fha_new.h:
Add shims for the old and new NFS servers to interface with
the FHA code, and callbacks for the

The shims contain all of the code and definitions that are
specific to the NFS servers.

They setup the server-specific callbacks and set the server
name for the sysctl and loader tunable variables.

sys/nfsserver/nfs_srvkrpc.c:
Configure the RPC code to call fhaold_assign() instead of
fha_assign().

sys/modules/nfsd/Makefile:
Add nfs_fha.c and nfs_fha_new.c.

sys/modules/nfsserver/Makefile:
Add nfs_fha_old.c.

Reviewed by: rmacklem
Sponsored by: Spectra Logic
MFC after: 2 weeks


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


247602 02-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


245909 25-Jan-2013 jhb

Further cleanups to use of timestamps in NFS:
- Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds
portion of wall-time stamps to manage timeouts on events.
- Remove unused nd_starttime from the per-request structure in the new
NFS server.
- Use nanotime() for the modification time on a delegation to get as
precise a time as possible.
- Use time_second instead of extracting the second from a call to
getmicrotime().

Submitted by: bde (3)
Reviewed by: bde, rmacklem
MFC after: 2 weeks


245613 18-Jan-2013 delphij

Make it possible to force async at server side on new NFS server, similar
to the old one's nfs.nfsrv.async.

Please note that by enabling this option (default is disabled), the system
could potentionally have silent data corruption if the server crashes
before write is committed to non-volatile storage, as the client side have
no way to tell if the data is already written.

Submitted by: rmacklem
MFC after: 2 weeks


245611 18-Jan-2013 jhb

Use vfs_timestamp() to set file timestamps rather than invoking
getmicrotime() or getnanotime() directly in NFS.

Reviewed by: rmacklem, bde
MFC after: 1 week


244042 08-Dec-2012 rmacklem

Move the NFSv4.1 client patches over from projects/nfsv4.1-client
to head. I don't think the NFS client behaviour will change unless
the new "minorversion=1" mount option is used. It includes basic
NFSv4.1 support plus support for pNFS using the Files Layout only.
All problems detecting during an NFSv4.1 Bakeathon testing event
in June 2012 have been resolved in this code and it has been tested
against the NFSv4.1 server available to me.
Although not reviewed, I believe that kib@ has looked at it.


243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


241561 14-Oct-2012 rmacklem

Add two new options to the nfssvc(2) syscall that allow
processes running as root to suspend/resume execution
of the kernel nfsd threads. An earlier version of this
patch was tested by Vincent Hoffman (vince at unsane.co.uk)
and John Hickey (jh at deterlab.net).

Reviewed by: kib
MFC after: 2 weeks


241025 28-Sep-2012 kib

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


240720 20-Sep-2012 rmacklem

Modify the NFSv4 client so that it can handle owner
and owner_group strings that consist entirely of
digits, interpreting them as the uid/gid number.
This change was needed since new (>= 3.3) Linux
servers reply with these strings by default.
This change is mandated by the rfc3530bis draft.
Reported on freebsd-stable@ under the Subject
heading "Problem with Linux >= 3.3 as NFSv4 server"
by Norbert Aschendorff on Aug. 20, 2012.

Tested by: norbert.aschendorff at yahoo.de
Reviewed by: jhb
MFC after: 2 weeks


235381 12-May-2012 rmacklem

Fix two cases in the new NFS server where a tsleep() is
used, when the code should actually protect the tested
variable with a mutex. Since the tsleep()s had a 10sec
timeout, the race would have only delayed the allocation
of a new clientid for a client. The sleeps will also
rarely occur, since having a callback in progress when
a client acquires a new clientid, is unlikely.
in practice, since having a callback in progress when
a fresh clientid is being acquired by a client is unlikely.

MFC after: 1 month


235136 08-May-2012 jwd

Use the common api helper routine instead of freeing the namei
buffer directly.

Approved by: rmacklem (mentor)
MFC after: 1 month


234740 27-Apr-2012 rmacklem

Fix a leak of namei lookup path buffers that occurs when a
ZFS volume is exported via the new NFS server. The leak occurred
because the new NFS server code didn't handle the case where
a file system sets the SAVENAME flag in its VOP_LOOKUP() and
ZFS does this for the DELETE case.

Tested by: Oliver Brandmueller (ob at gruft.de), hrs
PR: kern/167266
MFC after: 1 month


234482 20-Apr-2012 mckusick

This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


232467 03-Mar-2012 rmacklem

The name caching changes of r230394 exposed an intermittent bug
in the new NFS server for NFSv4, where it would report ENOENT
when the file actually existed on the server. This turned out
to be caused by not initializing ni_topdir before calling lookup()
and there was a rare case where the value on the stack location
assigned to ni_topdir happened to be a pointer to a ".." entry,
such that "dp == ndp->ni_topdir" succeeded in lookup().
This patch initializes ni_topdir to fix the problem.

MFC after: 5 days


232050 23-Feb-2012 rmacklem

hrs@ reported a panic to freebsd-stable@ under the subject line
"panic in 8.3-PRERELEASE" on Feb. 22, 2012. This panic was caused
by use of a mix of tsleep() and msleep() calls on the same event
in the new NFS server DRC code. It did "mtx_unlock(); tsleep();"
in two places, which kib@ noted introduced a slight risk that the
wakeup() would occur before the tsleep(), resulting in a 10sec
delay before waking up. This patch fixes the problem by replacing
"mtx_unlock(); tsleep();" with mtx_sleep(..PDROP..). It also
changes a nfsmsleep() call to mtx_sleep() so that the code uses
mtx_sleep() consistently within the file.

Tested by: hrs (in progress)
Reviewed by: jhb
MFC after: 5 days


231949 21-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


231805 16-Feb-2012 rmacklem

Delete a couple of out of date comments that are no longer true in
the new NFS client.

Requested by: bde
MFC after: 1 week


230100 14-Jan-2012 rmacklem

Tai Horgan reported via email that there were two places in
the new NFSv4 server where the code follows the wrong list.
Fortunately, for these fairly rare cases, the lc_stateid[]
lists are normally empty. This patch fixes the code to
follow the correct list.

Reported by: tai.horgan at isilon.com
Discussed with: zack
MFC after: 2 weeks


228560 16-Dec-2011 rmacklem

Patch the new NFS server in a manner analagous to r228520 for the
old NFS server, so that it correctly handles a count == 0 argument
for Commit.

PR: kern/118126
MFC after: 2 weeks


228260 04-Dec-2011 rmacklem

This patch adds a sysctl to the NFSv4 server which optionally disables the
check for a UTF-8 compliant file name. Enabling this sysctl results in
an NFSv4 server that is non-RFC3530 compliant, therefore it is not enabled
by default. However, enabling this sysctl results in NFSv3 compatible
behaviour and fixes the problem reported by "dan at sunsaturn.com"
to freebsd-current@ on Nov. 14, 2011 under the subject "NFSV4 readlink_stat".

Tested by: dan at sunsaturn.com
Reviewed by: zack
MFC after: 2 weeks


228185 01-Dec-2011 jhb

Enhance the sequential access heuristic used to perform readahead in the
NFS server and reuse it for writes as well to allow writes to the backing
store to be clustered.
- Use a prime number for the size of the heuristic table (1017 is not
prime).
- Move the logic to locate a heuristic entry from the table and compute
the sequential count out of VOP_READ() and into a separate routine.
- Use the logic from sequential_heuristic() in vfs_vnops.c to update the
seqcount when a sequential access is performed rather than just
increasing seqcount by 1. This lets the clustering count ramp up
faster.
- Allow for some reordering of RPCs and if it is detected leave the current
seqcount as-is rather than dropping back to a seqcount of 1. Also,
when out of order access is encountered, cut seqcount in half rather than
dropping it all the way back to 1 to further aid with reordering.
- Fix the new NFS server to properly update the next offset after a
successful VOP_READ() so that the readahead actually works.

Some of these changes came from an earlier patch by Bjorn Gronwall that was
forwarded to me by bde@.

Discussed with: bde, rmacklem, fs@
Submitted by: Bjorn Gronwall (1, 4)
MFC after: 2 weeks


227809 22-Nov-2011 rmacklem

This patch enables the new/default NFS server's use of shared
vnode locking for read, readdir, readlink, getattr and access.
It is hoped that this will improve server performance for these
operations, since they will no longer be serialized for a given
file/vnode.


225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


225356 03-Sep-2011 rmacklem

Fix the NFS servers so that they can do a Lookup of "..",
which requires that ni_strictrelative be set to 0, post-r224810.

Tested by: swills (earlier version), geo dot liaskos at gmail.com
Approved by: re (kib)


225049 20-Aug-2011 rmacklem

Fix the NFSv4 server so that it returns NFSERR_SYMLINK when
an attempt to do an Open operation on any type of file other
than VREG is done. A recent discussion on the IETF working group's
mailing list (nfsv4@ietf.org) decided that NFSERR_SYMLINK
should be returned for all non-regular files and not just symlinks,
so that the Linux client would work correctly.
This change does not affect the FreeBSD NFSv4 client and is not
believed to have a negative effect on other NFSv4 clients.

Reviewed by: zkirsch
Approved by: re (kib)
MFC after: 2 weeks


224911 16-Aug-2011 jonathan

Fix a merge conflict.

r224086 added "goto out"-style error handling to nfssvc_nfsd(), in order
to reliably call NFSEXITCODE() before returning. Our Capsicum changes,
based on the old "return (error)" model, did not merge nicely.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


224637 03-Aug-2011 zack

Fix an NFS server issue where it was not correctly setting the eof flag when a
READ had hit the end of the file. Also, clean up some cruft in the code.

Approved by: re (kib)
Reviewed by: rmacklem
MFC after: 2 weeks


224554 31-Jul-2011 rmacklem

Fix rename in the new NFS server so that it does not require a
recursive vnode lock on the directory for the case where the
new file name is in the same directory as the old one. The patch
handles this as a special case, recognized by the new directory
having the same file handle as the old one and just VREF()s the old
dir vnode for this case, instead of doing a second VFS_FHTOVP() to get it.
This is required so that the server will work for file systems like
msdosfs, that do not support recursive vnode locking.
This problem was discovered during recent testing by pho@
when exporting an msdosfs file system via the new NFS server.

Tested by: pho
Reviewed by: zkirsch
Approved by: re (kib)
MFC after: 2 weeks


224086 16-Jul-2011 zack

Add DEXITCODE plumbing to NFS.

Isilon has the concept of an in-memory exit-code ring that saves the last exit
code of a function and allows for stack tracing. This is very helpful when
debugging tough issues.

This patch is essentially a no-op for BSD at this point, until we upstream
the dexitcode logic itself. The patch adds DEXITCODE calls to every NFS
function that returns an errno error code. A number of code paths were also
reorganized to have single exit paths, to reduce code duplication.

Submitted by: David Kwan <dkwan@isilon.com>
Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


224083 16-Jul-2011 zack

Simple find/replace of VOP_ISLOCKED -> NFSVOPISLOCKED. This is done so that NFSVOPISLOCKED can be modified later to add enhanced logging and assertions.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


224082 16-Jul-2011 zack

Simple find/replace of VOP_UNLOCK -> NFSVOPUNLOCK. This is done so that NFSVOPUNLOCK can be modified later to add enhanced logging and assertions.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


224081 16-Jul-2011 zack

Simple find/replace of vn_lock -> NFSVOPLOCK. This is done so that NFSVOPLOCK can be modified later to add enhanced logging and assertions.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


224080 16-Jul-2011 zack

Remove unnecessary thread pointer from VOPLOCK macros and current users.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


224078 16-Jul-2011 zack

Move nfsvno_pathconf to be accessible to sys/fs/nfs; no functionality change.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


224077 16-Jul-2011 zack

Small acl patch to return the aclerror that comes back from nfsrv_dissectacl(). This fixes a problem where ATTRNOTSUPP was being returned instead of BADOWNER.

Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks


223373 21-Jun-2011 rmacklem

Fix the new NFSv4 server so that it checks for VREAD_ACL when
a client does a Getattr for an ACL and not VREAD_ATTRIBUTES.
This was found during the recent NFSv4 interoperability Bakeathon.

MFC after: 2 weeks


223349 20-Jun-2011 rmacklem

Fix the new NFSv4 server so that it only allows Lookup of
directories and symbolic links when traversing non-exported
file systems. Found during the recent NFSv4 interoperability
Bakeathon.

MFC after: 2 weeks


223348 20-Jun-2011 rmacklem

Fix the new NFSv4 server so that it allows Access and Readlink
operations while traversing non-exported file systems. This is
required for some non-FreeBSD clients to do NFSv4 mounts. Found during
the recent NFSv4 interoperability Bakeathon.

MFC after: 2 weeks


223312 19-Jun-2011 rmacklem

Fix a number of places where the new NFS server did not
lock the mutex when manipulating rc_flag in the DRC cache.
This is believed to fix a hung server that was reported
to the freebsd-fs@ list on June 9 under the subject heading
"New NFS server stress test hang", where all the threads
were waiting for the RC_LOCKED flag to clear.

Tested by: jwd at slowblink.com
MFC after: 2 weeks


223309 19-Jun-2011 rmacklem

Fix the kgssapi so that it can be loaded as a module. Currently
the NFS subsystems use five of the rpcsec_gss/kgssapi entry points,
but since it was not obvious which others might be useful, all
nineteen were included. Basically the nineteen entry points are
set in a structure called rpc_gss_entries and inline functions
defined in sys/rpc/rpcsec_gss.h check for the entry points being
non-NULL and then call them. A default value is returned otherwise.
Requested by rwatson.

Reviewed by: jhb
MFC after: 2 weeks


222663 04-Jun-2011 rmacklem

Modify the new NFS server so that the NFSv3 Pathconf RPC
doesn't return an error when the underlying file system
lacks support for any of the four _PC_xxx values used, by
falling back to default values.

Tested by: avg
MFC after: 2 weeks


222389 27-May-2011 rmacklem

Fix the new NFS client so that it handles NFSv4 state
correctly during a forced dismount. This required that
the exclusive and shared (refcnt) sleep lock functions check
for MNTK_UMOUNTF before sleeping, so that they won't block
while nfscl_umount() is getting rid of the state. As
such, a "struct mount *" argument was added to the locking
functions. I believe the only remaining case where a forced
dismount can get hung in the kernel is when a thread is
already attempting to do a TCP connect to a dead server
when the krpc client structure called nr_client is NULL.
This will only happen just after a "mount -u" with options
that force a new TCP connection is done, so it shouldn't
be a problem in practice.

MFC after: 2 weeks


222167 22-May-2011 rmacklem

Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by: kib


221615 08-May-2011 rmacklem

Change the new NFS server so that it uses vfs.nfsd naming
for its sysctls instead of vfs.newnfs. This separates the
names from the ones used by the client.


221517 06-May-2011 rmacklem

Change the new NFS server so that it returns 0 when the f_bavail
or f_ffree fields of "struct statfs" are negative, since the
values that go on the wire are unsigned and will appear to be
very large positive values otherwise. This makes the handling
of a negative f_bavail compatible with the old/regular NFS server.

MFC after: 2 weeks


220648 14-Apr-2011 rmacklem

Fix the experimental NFSv4 server so that it uses VOP_PATHCONF()
to determine if a file system supports NFSv4 ACLs. Since
VOP_PATHCONF() must be called with a locked vnode, the function
is called before nfsvno_fillattr() and the result is passed in
as an extra argument.

MFC after: 2 weeks


220645 14-Apr-2011 rmacklem

Modify the experimental NFSv4 server so that it handles
crossing of server mount points properly. The functions
nfsvno_fillattr() and nfsv4_fillattr() were modified to
take the extra arguments that are the mount point, a flag
to indicate that it is a file system root and the mounted
on fileno. The mount point argument needs to be busy when
nfsvno_fillattr() is called, since the vp argument is not
locked.

Reviewed by: kib
MFC after: 2 weeks


220546 11-Apr-2011 rmacklem

Vrele ni_startdir in the experimental NFS server for the case
of NFSv2 getting an error return from VOP_MKNOD(). Without this
patch, the server file system remains busy after an NFSv2
VOP_MKNOD() fails.

MFC after: 2 weeks


220530 10-Apr-2011 rmacklem

Add some cleanup code to the module unload operation for
the experimental NFS server, so that it doesn't leak memory
when unloaded. However, unloading the NFSv4 server is not
recommended, since all NFSv4 state will be lost by the unload
and clients will have to recover the state after a server
reload/restart as if the server crashed/rebooted.

MFC after: 2 weeks


220507 10-Apr-2011 rmacklem

Add a VOP_UNLOCK() for the directory, when that is not what
VOP_LOOKUP() returned. This fixes a bug in the experimental
NFS server for the case where VFS_VGET() fails returning EOPNOTSUPP
in the ReaddirPlus RPC, forcing the use of VOP_LOOKUP() instead.

MFC after: 2 weeks


219028 25-Feb-2011 netchild

Add some FEATURE macros for various features (AUDIT/CAM/IPC/KTR/MAC/NFS/NTP/
PMC/SYSV/...).

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: arch@ (parts by rwatson, trasz, jhb)
X-MFC after: to be determined in last commit with code from this project


218345 05-Feb-2011 alc

Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean(). They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by: kib
MFC after: 3 weeks


217432 14-Jan-2011 rmacklem

Modify the experimental NFSv4 server so that it posts a SIGUSR2
signal to the master nfsd daemon whenever the stable restart
file has been modified. This will allow the master nfsd daemon
to maintain an up to date backup copy of the file. This is
enabled via the nfssvc() syscall, so that older nfsd daemons
will not be signaled.

Reviewed by: jhb
MFC after: 1 week


217336 12-Jan-2011 zack

In the experimental NFS server, when converting an open-owner to a lock-owner,
start at sequence id 1 instead of 0, to match up with both Solaris and Linux.

Reviewed by: rmacklem
Approved by: zml (mentor)


217335 12-Jan-2011 zack

Clean up the experimental NFS server replay cache when the module is unloaded.

Reviewed by: rmacklem
Approved by: zml (mentor)


217176 09-Jan-2011 rmacklem

Modify readdirplus in the experimental NFS server in a
manner analogous to r216633 for the regular server. This
change busies the file system so that VFS_VGET() is
guaranteed to be using the correct mount point even
during a forced dismount attempt. Since nfsd_fhtovp() is
not called immediately before readdirplus, the patch is
actually a clone of pjd@'s nfs_serv.c.4.patch instead of
the one committed in r216633.

Reviewed by: kib
MFC after: 10 days


217066 06-Jan-2011 rmacklem

Delete the NFS_STARTWRITE() and NFS_ENDWRITE() macros that
obscured vn_start_write() and vn_finished_write() for the
old OpenBSD port, since most uses have been replaced by the
correct calls.

MFC after: 12 days


217063 06-Jan-2011 rmacklem

Since the VFS_LOCK_GIANT() code in the experimental NFS
server is broken and the major file systems are now all
mpsafe, modify the server so that it will only export
mpsafe file systems. This was discussed on freebsd-fs@
and removes a fair bit of crufty code.

MFC after: 12 days


217023 05-Jan-2011 rmacklem

Modify the experimental NFS server so that it calls
vn_start_write() with a non-NULL vp. That way it will
find the correct mount point mp and use that mp for the
subsequent vn_finished_write() call. Also, it should fail
without crashing if the mount point is being forced dismounted
because vn_start_write() will set the mp NULL via VOP_GETWRITEMOUNT().

Reviewed by: kib
MFC after: 12 days


217017 05-Jan-2011 rmacklem

Fix the experimental NFS server to use vfs_busyfs() instead
of vfs_getvfs() so that the mount point is busied for the
VFS_FHTOVP() call. This is analagous to r185432 for the
regular NFS server.

Reviewed by: kib
MFC after: 12 days


216931 03-Jan-2011 rmacklem

Fix the nlm so that it no longer depends on the regular
nfs client and, as such, can be loaded for the experimental
nfs client without the regular client.

Reviewed by: jhb
MFC after: 2 weeks


216898 03-Jan-2011 rmacklem

Fix the experimental NFS server so that it doesn't leak
a reference count on the directory when creating device
special files.

MFC after: 2 weeks


216897 03-Jan-2011 rmacklem

Modify the experimental NFSv4 server so that the lookup
ops return a locked vnode. This ensures that the associated mount
point will always be valid for the code that follows the operation.
Also add a couple of additional checks
for non-error to the other functions that create file objects.

MFC after: 2 weeks


216894 02-Jan-2011 rmacklem

Delete some cruft from the experimental NFS server that was
only used by the OpenBSD port for its pseudo-fs.

MFC after: 2 weeks


216893 02-Jan-2011 rmacklem

Add checks for VI_DOOMED and vn_lock() failures to the
experimental NFS server, to handle the case where an
exported file system is forced dismounted while an RPC
is in progress. Further commits will fix the cases where
a mount point is used when the associated vnode isn't locked.

Reviewed by: kib
MFC after: 2 weeks


216875 01-Jan-2011 rmacklem

Add support for shared vnode locks for the Read operation
in the experimental NFSv4 server.

Reviewed by: kib
MFC after: 2 weeks


216784 28-Dec-2010 rmacklem

Delete the nfsvno_localconflict() function in the experimental
NFS server since it is no longer used and is broken.

MFC after: 2 weeks


216700 25-Dec-2010 rmacklem

Modify the experimental NFS server so that it uses LK_SHARED
for RPC operations when it can. Since VFS_FHTOVP() currently
always gets an exclusively locked vnode and is usually called
at the beginning of each RPC, the RPCs for a given vnode will
still be serialized. As such, passing a lock type argument to
VFS_FHTOVP() would be preferable to doing the vn_lock() with
LK_DOWNGRADE after the VFS_FHTOVP() call.

Reviewed by: kib
MFC after: 2 weeks


216693 24-Dec-2010 rmacklem

Add an argument to nfsvno_getattr() in the experimental
NFS server, so that it can avoid calling VOP_ISLOCKED()
when the vnode is known to be locked. This will allow
LK_SHARED to be used for these cases, which happen to
be all the cases that can use LK_SHARED. This does not
fix any bug, but it reduces the number of calls to
VOP_ISLOCKED() and prepares the code so that it can be
switched to using LK_SHARED in a future patch.

Reviewed by: kib
MFC after: 2 weeks


216692 24-Dec-2010 rmacklem

Simplify vnode locking in the expeimental NFS server's
readdir functions. In particular, get rid of two bogus
VOP_ISLOCKED() calls. Removing the VOP_ISLOCKED() calls
is the only actual bug fixed by this patch.

Reviewed by: kib
MFC after: 2 weeks


216691 24-Dec-2010 rmacklem

Since VOP_READDIR() for ZFS does not return monotonically
increasing directory offset cookies, disable the UFS related
loop that skips over directory entries at the beginning of
the block for the experimental NFS server. This loop is
required for UFS since it always returns directory entries
starting at the beginning of the block that
the requested directory offset is in. In discussion with pjd@
and mckusick@ it seems that this behaviour of UFS should maybe
change, with this fix being an interim patch until then.
This patch only fixes the experimental server, since pjd@ is
working on a patch for the regular server.

Discussed with: pjd, mckusick
MFC after: 5 days


216510 17-Dec-2010 rmacklem

Fix two vnode locking problems in nfsd_recalldelegation() in the
experimental NFSv4 server. The first was a bogus use of VOP_ISLOCKED()
in a KASSERT() and the second was the need to lock the vnode for the
nfsrv_checkremove() call. Also, delete a "__unused" that was bogus,
since the argument is used.

Reviewed by: zack.kirsch at isilon.com
MFC after: 2 weeks


216330 09-Dec-2010 rmacklem

Disable attempts to establish a callback connection from the
experimental NFSv4 server to a NFSv4 client when delegations are not
being issued, even if the client advertises a callback path.
This avoids a problem where a Linux client advertises a
callback path that doesn't work, due to a firewall, and then
times out an Open attempt before the FreeBSD server gives up
its callback connection attempt. (Suggested by
drb at karlov.mff.cuni.cz to fix the Linux client problem that
he reported on the fs-stable mailing list.)
The server should probably have
a 1sec timeout on callback connection attempts when there are
no delegations issued to the client, but that patch will require
changes to the krpc and this serves as a work around until then.

Tested by: drb at karlov.mff.cuni.cz
MFC after: 5 days


214255 23-Oct-2010 rmacklem

Modify the experimental NFSv4 server's file handle hash function
to use the generic hash32_buf() function. Although adding the
bytes seemed sufficient for UFS and ZFS, since most of the bytes
are the same for file handles on the same volume, this might not
be sufficient for other file systems. Use of a generic function
also seems preferable to one specific to NFSv4.

Suggested by: gleb.kurtsou at gmail.com
MFC after: 10 days


214224 22-Oct-2010 rmacklem

Modify the file handle hash function in the experimental NFS
server so that it will work better for non-UFS file systems.
The new function simply sums the bytes of the fh_fid field
of fhandle_t.

MFC after: 10 days


214149 21-Oct-2010 rmacklem

Modify the experimental NFS server in a manner analagous to
r214049 for the regular NFS server, so that it will not do
a VOP_LOOKUP() of ".." when at the root of a file system
when performing a ReaddirPlus RPC.

MFC after: 10 days


213712 11-Oct-2010 rmacklem

Try and make the nfsrv_localunlock() function in the experimental
NFSv4 server more readable. Mostly changes to comments, but a
case of >= is changed to >, since == can never happen. Also, I've
added a couple of KASSERT()s and a slight optimization, since
once the "else if" case happens, subsequent locks in the list can't
have any effect. None of these changes fixes any known bug.

MFC after: 2 weeks


212834 19-Sep-2010 rmacklem

Fix nfsrv_freeallnfslocks() in the experimental NFSv4 server so that
it frees local locks correctly upon close. In order for
nfsrv_localunlock() to work correctly, the lock can no longer be in
the lockowner's stateid list. As such, nfsrv_freenfslock() has to
be called before nfsrv_localunlock(), to get rid of the lock structure
on the lockowner's stateid list. This only affected operation when
local locks (vfs.newnfs.enable_locallocks=1) are enabled, which is
not the default at this time.

MFC after: 1 week


212833 19-Sep-2010 rmacklem

Fix the experimental NFSv4 server so that it performs local VOP_ADVLOCK()
unlock operations correctly. It was passing in F_SETLK instead of
F_UNLCK as the operation for the unlock case. This only affected
operation when local locking (vfs.newnfs.enable_locallocks=1) was enabled.

MFC after: 1 week


212443 10-Sep-2010 rmacklem

This patch applies one of the two fixes suggested by
zack.kirsch at isilon.com for a race between nfsrv_freeopen()
and nfsrv_getlockfile() in the experimental NFS server that
he found during testing. Although nfsrv_freeopen() holds a
sleep lock on the lock file structure when called with
cansleep != 0, nfsrv_getlockfile() could still search the
list, once it acquired the NFSLOCKSTATE() mutex. I believe
that acquiring the mutex in nfsrv_freeopen() fixes the race.

MFC after: 2 weeks


211953 28-Aug-2010 rmacklem

Add acquisition of a reference count on nfsv4root_lock to the
nfsd_recalldelegation() function, since this function is called
by nfsd threads when they are handling NFSv2 or NFSv3 RPCs, where
no reference count would have been acquired.

MFC after: 2 weeks


211951 28-Aug-2010 rmacklem

The timer routine in the experimental NFS server did not acquire
the correct mutex when checking nfsv4root_lock. Although this
could be fixed by adding mutex lock/unlock calls, zack.kirsch at
isilon.com suggested a better fix that uses a non-blocking
acquisition of a reference count on nfsv4root_lock. This fix
allows the weird NFSLOCKSTATE(); NFSUNLOCKSTATE(); synchronization
to be deleted. This patch applies this fix.

Tested by: zack.kirsch at isilon.com
MFC after: 2 weeks


210178 16-Jul-2010 rmacklem

Patch the experimental NFSv4 server so that it acquires a reference
count on nfsv4rootfs_lock when dumping state, since these functions
are not called by nfsd threads. Without this reference count, it
is possible for an nfsd thread to acquire an exclusive lock on
nfsv4rootfs_lock while the dump is in progress and then change the
lists, potentially causing a crash.

Reported by: zack.kirsch at isilon.com
MFC after: 2 weeks


210154 16-Jul-2010 rmacklem

Delete comments related to soft clock interrupts that don't apply
to the FreeBSD port of the experimental NFSv4 server.

Submitted by: zack.kirsch at isilon.com
MFC after: 2 weeks


210102 15-Jul-2010 rmacklem

This patch fixes a bug in the experimental NFSv4 server where it
released a reference count on nfsv4rootfs_lock erroneously when
administrative revocation of state was done.

Submitted by: zack.kirsch at isilon.com
MFC after: 2 weeks


210030 13-Jul-2010 rmacklem

Fix a bogus comment that mentions lru lists that don't exist.

Reported by: zack.kirsch at isilon.com
MFC after: 2 weeks


209191 15-Jun-2010 rmacklem

Add MODULE_DEPEND() macros to the experimental NFS client and
server so that the modules will load when kernels are built with
none of the NFS* configuration options specified. I believe this
resolves the problems reported by PR kern/144458 and the email on
freebsd-stable@ posted by Dmitry Pryanishnikov on June 13.

Tested by: kib
PR: kern/144458
Reviewed by: kib
MFC after: 1 week


209120 13-Jun-2010 kib

In NFS clients, instead of inconsistently using #ifdef
DIAGNOSTIC and #ifndef DIAGNOSTIC for debug assertions, prefer
KASSERT(). Also change one #ifdef DIAGNOSTIC in the new nfs server.

Submitted by: Mikolaj Golub <to.my.trociny gmail com>
MFC after: 2 weeks


207170 24-Apr-2010 rmacklem

An NFSv4 server will reply NFSERR_GRACE for non-recovery RPCs
during the grace period after startup. This grace period must
be at least the lease duration, which is typically 1-2 minutes.
It seems prudent for the experimental NFS client to wait a few
seconds before retrying such an RPC, so that the server isn't
flooded with non-recovery RPCs during recovery. This patch adds
an argument to nfs_catnap() to implement a 5 second delay
for this case.

MFC after: 1 week


206236 06-Apr-2010 rmacklem

Harden the experimental NFS server a little, by adding range
checks on the length of the client's open/lock owner name. Also,
add free()'s for one case where they were missing and would
have caused a leak if NFSERR_BADXDR had been replied. Probably
never happens, but the leak is now plugged, just in case.

MFC after: 2 weeks


206170 04-Apr-2010 rmacklem

Harden the experimental NFS server a little, by adding extra checks
in the readdir functions for non-positive byte count arguments.
For the negative case, set it to the maximum allowable, since it
was actually a large positive value (unsigned) on the wire.
Also, fix up the readdir function comment a bit.

Suggested by: dillon AT apollo.backplane.com
MFC after: 2 weeks


206063 02-Apr-2010 rmacklem

For the experimental NFS server, add a call to free the lookup
path buffer for one case where it was missing when doing mkdir.
This could have conceivably resulted in a leak of a buffer, but
a leak was never observed during testing, so I suspect it would
have occurred rarely, if ever, in practice.

MFC after: 2 weeks


206061 02-Apr-2010 rmacklem

Add SAVENAME to the cn_flags for all cases in the experimental
NFS server for the CREATE cn_nameiop where SAVESTART isn't set.
I was not aware that this needed to be done by the caller until
recently.

Tested by: lampa AT fit.vutbr.cz (link case)
Submitted by: lampa AT fit.vutbr.cz (link case)
MFC after: 2 weeks


205941 30-Mar-2010 rmacklem

This patch should fix handling of byte range locks locally
on the server for the experimental nfs server. When enabled
by setting vfs.newnfs.locallocks_enable to non-zero, the
experimental nfs server will now acquire byte range locks
on the file on behalf of NFSv4 clients, such that lock
conflicts between the NFSv4 clients and processes running
locally on the server, will be recognized and handled correctly.

MFC after: 2 weeks


205663 26-Mar-2010 rmacklem

Patch the experimental NFS server in a manner analagous to r205661
for the regular NFS server, to ensure that ESTALE is
returned to the client for all errors returned by VFS_FHTOVP().

MFC after: 2 weeks


205010 11-Mar-2010 rwatson

Update nfsrv_getsocksndseq() for changes in TCP internals since FreeBSD 6.x:

- so_pcb is now guaranteed to be non-NULL and valid if a valid socket
reference is held.

- Need to check INP_TIMEWAIT and INP_DROPPED before assuming inp_ppcb is a
tcpcb, as it might be a tcptw or NULL otherwise.

- tp can never be NULL by the end of the function, so only check
TCPS_ESTABLISHED before extracting tcpcb fields.

The NFS server arguably incorporates too many assumptions about TCP
internals, but fixing that is left for nother day.

MFC after: 1 week
Reviewed by: bz
Reviewed and tested by: rmacklem
Sponsored by: Juniper Networks


203849 14-Feb-2010 rmacklem

Change the default value for vfs.newnfs.enable_locallocks to 0 for
the experimental NFS server, since local locking is known to be
broken and the patch to fix it is still a work in progress.

MFC after: 5 days


203848 14-Feb-2010 rmacklem

This fixes the experimental NFS server so that it won't crash in the
caching code for IPv6 by fixing a typo that used the incorrect variable.
It also fixes the indentation of the statement above it.

Reported by: simon AT comsys.ntu-kpi.kiev.ua
MFC after: 5 days


201442 03-Jan-2010 rmacklem

The test for "same client" for the experimental nfs server over NFSv4
was broken w.r.t. byte range lock conflicts when it was the same client
and the request used the open_to_lock_owner4 case, since lckstp->ls_clp
was not set. This patch fixes it by using "clp" instead of "lckstp->ls_clp".

MFC after: 2 weeks


200999 25-Dec-2009 rmacklem

Modify the experimental server so that it uses VOP_ACCESSX().
This is necessary in order to enable NFSv4 ACL support. The
argument to nfsvno_accchk() was changed to an accmode_t and
the function nfsrv_aclaccess() was no longer needed and,
therefore, deleted.

Reviewed by: trasz
MFC after: 2 weeks


200287 08-Dec-2009 delphij

Allow using IPv6 in nfsrvd_sentcache() callback.

PR: kern/141289
Submitted by: Petr Lampa <lampa fit vutbr cz>
Approved by: rmacklem
MFC after: 1 week


199715 23-Nov-2009 rmacklem

Modify the experimental nfs server so that it falls back to
using VOP_LOOKUP() when VFS_VGET() returns EOPNOTSUPP in the
ReaddirPlus RPC. This patch is based upon one by pjd@ for the
regular nfs server which has not yet been committed. It is needed
when a ZFS volume is exported and ReaddirPlus (which almost
always happens for NFSv4) is performed by a client. The patch
also simplifies vnode lock handling somewhat.

MFC after: 2 weeks


199616 20-Nov-2009 rmacklem

Patch the experimental NFS server is a manner analagous to
r197525, so that the creation verifier is handled correctly
in va_atime for 64bit architectures. There were two problems.
One was that the code incorrectly assumed that
sizeof (struct timespec) == 8 and the other was that the tv_sec
field needs to be assigned from a signed 32bit integer, so that
sign extension occurs on 64bit architectures. This is required
for correct operation when exporting ZFS volumes.

Reviewed by: pjd
MFC after: 2 weeks


195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


194498 19-Jun-2009 brooks

Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.

Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.

Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867


194408 17-Jun-2009 rmacklem

Add the SVC_RELEASE(xprt), as required by r194407.

Approved by: kib (mentor)


194292 16-Jun-2009 rmacklem

Remove the "int *" typecast for the aresid argument to vn_rdwr()
and change the type of the argument from size_t to int. This
should avoid issues on 64bit architectures.

Suggested by: kib
Approved by: kib (mentor)


193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


193162 31-May-2009 zec

Unbreak options VIMAGE kernel builds.

Approved by: julian (mentor)


192861 26-May-2009 rmacklem

Fix the experimental nfs subsystem so that it builds with the
current NFSv4 ACLs, as defined in sys/acl.h. It still needs a
way to test a mount point for NFSv4 ACL support before it will
work. Until then, the NFSHASNFS4ACL() macro just always returns 0.

Approved by: kib (mentor)


192782 26-May-2009 rmacklem

Add two sysctl variables to the experimental nfs server, so
that the range of versions of NFS handled by the server can
be limited. The nfsd daemon must be restarted after these
sysctl variables are changed, in order for the change to take
effect.

Approved by: kib (mentor)


192781 26-May-2009 rmacklem

Fix the handling of NFSv4 Illegal Operation number to conform
to RFC3530 (the operation number in the reply must be set to
the value for OP_ILLEGAL). Also cleaned up some indentation.

Approved by: kib (mentor)


192780 26-May-2009 rmacklem

Fix the experimental nfs server's interface to the new krpc so
that it handles the case of a non-exported NFSv4 root correctly.
Also, delete handling for the case where nd_repstat is already
set in nfs_proc(), since that no longer happens.

Approved by: kib (mentor)


192707 25-May-2009 rmacklem

Add NFSv4 root export checks to the DelegPurge, Renew and
ReleaseLockOwner operations analagous to what is already
in place for SetClientID and SetClientIDConfirm. These are
the five NFSv4 operations that do not use file handle(s),
so the checks are done using the NFSv4 root export entries
in /etc/exports.

Approved by: kib (mentor)


192693 24-May-2009 rmacklem

Fix the experimental NFSv4 server so that it handles the case
where a client is not allowed NFSv4 access correctly. This
restriction is specified in the "V4: ..." line(s) in
/etc/exports.

Approved by: kib (mentor)


192596 22-May-2009 rmacklem

Change the printf of r192595 to identify the function,
as requested by Sam.

Approved by: kib (mentor)


192591 22-May-2009 rmacklem

Modified the printf message of r192590 to remove the
possible DOS attack, as suggested by Sam.

Approved by: kib (mentor)


192589 22-May-2009 rmacklem

Change the comment at the beginning of the function to reflect the
change from panic() to printf() done by r192588.


192588 22-May-2009 rmacklem

Change the reboot panic that would have occurred if clientid
numbers wrapped around to a printf() warning of a possible
DOS attack, in the experimental nfsv4 server.

Approved by: kib (mentor)


192574 22-May-2009 rmacklem

Fix the experimental nfs server so that it depends on the nlm,
since it now calls nlm_acquire_next_sysid().

Approved by: kib (mentor)


192539 21-May-2009 rmacklem

Fix the comment at line 3711 to be consistent with the change
applied for r192537.

Approved by: kib (mentor)


192503 21-May-2009 rmacklem

Modify sys/fs/nfsserver/nfs_nfsdport.c to use nlm_acquire_next_sysid()
to set the l_sysid for locks correctly.

Approved by: kib (mentor)


192463 20-May-2009 rmacklem

Although it should never happen, all the nfsv4 server can do
when it runs out of clientids is reboot. I had replaced cpu_reboot()
with printf(), since cpu_reboot() doesn't exist for sparc64.
This change replaces the printf() with panic(), so the reboot
would occur for this highly unlikely occurrence.

Approved by: kib (mentor)


192256 17-May-2009 rmacklem

Fix the acquisition of local locks via VOP_ADVLOCK() by the
experimental nfsv4 server. It was setting the a_id argument
to a fixed value, but that wasn't sufficient for FreeBSD8.
Instead, set l_pid and l_sysid to 0 plus set the F_REMOTE
flag to indicate that these fields are used to check for
same lock owner. Since, for NFSv4, a lockowner is a ClientID plus
an up to 1024byte name, it can't be put in l_sysid easily.
I also renamed the p variable to td, since it's a thread ptr.

Approved by: kib (mentor)


192255 17-May-2009 rmacklem

Added a SYSCTL to sys/fs/nfsserver/nfs_nfsdport.c so that the value of
nfsrv_dolocallocks can be changed via sysctl. I also added some non-empty
descriptor strings and reformatted some overly long lines.

Approved by: kib (mentor)


192181 16-May-2009 rmacklem

Fixed the Null callback RPCs so that they work with the new krpc. This
required two changes: setting the program and version numbers before
connect and fixing the handling of the Null Rpc case in newnfs_request().

Approved by: kib (mentor)


192121 14-May-2009 rmacklem

Apply changes to the experimental nfs server so that it uses the security
flavors as exported in FreeBSD-CURRENT. This allows it to use a
slightly modified mountd.c instead of a different utility.

Approved by: kib (mentor)


192017 12-May-2009 rmacklem

Modify the experimental nfs server to use the new nfsd_nfsd_args
structure for nfsd. Includes a change that clarifies the use of
an empty principal name string to indicate AUTH_SYS only.

Approved by: kib (mentor)


192000 11-May-2009 rmacklem

Change the name of the nfs server addsock structure from nfsd_args
to nfsd_addsock_args, so that it is consistent with the one in
sys/nfsserver/nfs.h.

Approved by: kib (mentor)


191998 11-May-2009 rmacklem

Modify nfsvno_fhtovp() to ensure that it always sets the credp
argument. Returning without credp set could result in a caller
doing crfree() on garbage.

Reviewed by: kan
Approved by: kib (mentor)


191990 11-May-2009 attilio

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


191940 09-May-2009 kan

Do not embed struct ucred into larger netcred parent structures.

Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by: John Hickey
PR: kern/133439
Reviewed by: dfr, kib


191783 04-May-2009 rmacklem

Add the experimental nfs subtree to the kernel, that includes
support for NFSv4 as well as NFSv2 and 3.
It lives in 3 subdirs under sys/fs:
nfs - functions that are common to the client and server
nfsclient - a mutation of sys/nfsclient that call generic functions
to do RPCs and handle state. As such, it retains the
buffer cache handling characteristics and vnode semantics that
are found in sys/nfsclient, for the most part.
nfsserver - the server. It includes a DRC designed specifically for
NFSv4, that is used instead of the generic DRC in sys/rpc.
The build glue will be checked in later, so at this point, it
consists of 3 new subdirs that should not affect kernel building.

Approved by: kib (mentor)