History log of /freebsd-10-stable/sys/rpc/svc.h
Revision Date Author Comments
# 314034 21-Feb-2017 avg

MFC r313735: add svcpool_close to handle killed nfsd threads

PR: 204340
Reported by: Panzura
Reviewed by: rmacklem
Approved by: rmacklem


# 290203 30-Oct-2015 wollman

Long-overdue MFC of r280930:

Fix overflow bugs in and remove obsolete limit from kernel RPC
implementation.

The kernel RPC code, which is responsible for the low-level scheduling
of incoming NFS requests, contains a throttling mechanism that
prevents too much kernel memory from being tied up by NFS requests
that are being serviced. When the throttle is engaged, the RPC layer
stops servicing incoming NFS sockets, resulting ultimately in
backpressure on the clients (if they're using TCP). However, this is
a very heavy-handed mechanism as it prevents all clients from making
any requests, regardless of how heavy or light they are. (Thus, when
engaged, the throttle often prevents clients from even mounting the
filesystem.) The throttle mechanism applies specifically to requests
that have been received by the RPC layer (from a TCP or UDP socket)
and are queued waiting to be serviced by one of the nfsd threads; it
does not limit the amount of backlog in the socket buffers.

The original implementation limited the total bytes of queued requests
to the minimum of a quarter of (nmbclusters * MCLBYTES) and 45 MiB.
The former limit seems reasonable, since requests queued in the socket
buffers and replies being constructed to the requests in progress will
all require some amount of network memory, but the 45 MiB limit is
plainly ridiculous for modern memory sizes: when running 256 service
threads on a busy server, 45 MiB would result in just a single
maximum-sized NFS3PROC_WRITE queued per thread before throttling.

Removing this limit exposed integer-overflow bugs in the original
computation, and related bugs in the routines that actually account
for the amount of traffic enqueued for service threads. The old
implementation also attempted to reduce accounting overhead by
batching updates until each queue is fully drained, but this is prone
to livelock, resulting in repeated accumulate-throttle-drain cycles on
a busy server. Various data types are changed to long or unsigned
long; explicit 64-bit types are not used due to the unavailability of
64-bit atomics on many 32-bit platforms, but those platforms also
cannot support nmbclusters large enough to cause overflow.

This code (in a 10.1 kernel) is presently running on production NFS
servers at CSAIL.

Summary of this revision:
* Removes 45 MiB limit on requests queued for nfsd service threads
* Fixes integer-overflow and signedness bugs
* Avoids unnecessary throttling by not deferring accounting for
completed requests

Differential Revision: https://reviews.freebsd.org/D2165
Reviewed by: rmacklem, mav
Relnotes: yes
Sponsored by: MIT Computer Science & Artificial Intelligence Laboratory


# 269398 01-Aug-2014 rmacklem

MFC: r268115
Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.


# 267742 22-Jun-2014 mav

MFC r267228:
Split RPC pool threads into number of smaller semi-isolated groups.

Old design with unified thread pool was good from the point of thread
utilization. But single pool-wide mutex became huge congestion point
for systems with many CPUs. To reduce the congestion create several
thread groups within a pool (one group for every 6 CPUs and 12 threads),
each group with own mutex. Each connection during its registration is
assigned to one of the groups in round-robin fashion. File affinify
code may still move requests between the groups, but otherwise groups
are self-contained.


# 267741 22-Jun-2014 mav

MFC r267223:
Remove st_idle variable, duplicating st_xprt.


# 267740 22-Jun-2014 mav

MFC r267221, r267278:
Introduce new per-thread lock to protect the list of requests.

This allows to slightly simplify svc_run_internal() code: if we processed
all the requests in a queue, then we know that new one will not appear.


# 261055 22-Jan-2014 mav

MFC r260229, r260258, r260367, r260390, r260459, r260648:
Rework NFS Duplicate Request Cache cleanup logic.

- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by: iXsystems, Inc.


# 261054 22-Jan-2014 mav

MFC r260097:
Move most of NFS file handle affinity code out of the heavily congested
global RPC thread pool lock and protect it with own set of locks.

On synthetic benchmarks this improves peak NFS request rate by 40%.


# 261053 22-Jan-2014 mav

MFC r260036:
Introduce xprt_inactive_self() -- variant for use when sure that port
is assigned to thread. For example, withing receive handlers. In that
case the function reduces to single assignment and can avoid locking.


# 261048 22-Jan-2014 mav

MFC r259659, r259662:
Remove several linear list traversals per request from RPC server code.

Do not insert active ports into pool->sp_active list if they are success-
fully assigned to some thread. This makes that list include only ports that
really require attention, and so traversal can be reduced to simple taking
the first one.

Remove idle thread from pool->sp_idlethreads list when assigning some
work (port of requests) to it. That again makes possible to replace list
traversals with simple taking the first element.


# 261046 22-Jan-2014 mav

MFC r258578, r258580, r258581 (by hrs):
Replace Sun RPC license in TI-RPC library with a 3-clause BSD license
with the explicit permissions.


# 269398 01-Aug-2014 rmacklem

MFC: r268115
Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.


# 267742 22-Jun-2014 mav

MFC r267228:
Split RPC pool threads into number of smaller semi-isolated groups.

Old design with unified thread pool was good from the point of thread
utilization. But single pool-wide mutex became huge congestion point
for systems with many CPUs. To reduce the congestion create several
thread groups within a pool (one group for every 6 CPUs and 12 threads),
each group with own mutex. Each connection during its registration is
assigned to one of the groups in round-robin fashion. File affinify
code may still move requests between the groups, but otherwise groups
are self-contained.


# 267741 22-Jun-2014 mav

MFC r267223:
Remove st_idle variable, duplicating st_xprt.


# 267740 22-Jun-2014 mav

MFC r267221, r267278:
Introduce new per-thread lock to protect the list of requests.

This allows to slightly simplify svc_run_internal() code: if we processed
all the requests in a queue, then we know that new one will not appear.


# 261055 22-Jan-2014 mav

MFC r260229, r260258, r260367, r260390, r260459, r260648:
Rework NFS Duplicate Request Cache cleanup logic.

- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by: iXsystems, Inc.


# 261054 22-Jan-2014 mav

MFC r260097:
Move most of NFS file handle affinity code out of the heavily congested
global RPC thread pool lock and protect it with own set of locks.

On synthetic benchmarks this improves peak NFS request rate by 40%.


# 261053 22-Jan-2014 mav

MFC r260036:
Introduce xprt_inactive_self() -- variant for use when sure that port
is assigned to thread. For example, withing receive handlers. In that
case the function reduces to single assignment and can avoid locking.


# 261048 22-Jan-2014 mav

MFC r259659, r259662:
Remove several linear list traversals per request from RPC server code.

Do not insert active ports into pool->sp_active list if they are success-
fully assigned to some thread. This makes that list include only ports that
really require attention, and so traversal can be reduced to simple taking
the first one.

Remove idle thread from pool->sp_idlethreads list when assigning some
work (port of requests) to it. That again makes possible to replace list
traversals with simple taking the first element.


# 261046 22-Jan-2014 mav

MFC r258578, r258580, r258581 (by hrs):
Replace Sun RPC license in TI-RPC library with a 3-clause BSD license
with the explicit permissions.