Cross Reference: /linux-master/fs/nfsd/nfs4state.c

History log of /linux-master/fs/nfsd/nfs4state.c
Revision	Date	Author	Comments
# d43113fb	07-Apr-2024	NeilBrown <neilb@suse.de>	nfsd: optimise recalculate_deny_mode() for a common case recalculate_deny_mode() takes time that is linear in the number of stateids active on the file. When called from release_openowner -> free_ol_stateid_reaplist ->nfs4_free_ol_stateid -> release_all_access the number of times it is called is linear in the number of stateids. The net result is that time taken by release_openowner is quadratic in the number of stateids. When the nfsd server is shut down while there are many active stateids this can result in a soft lockup. ("CPU stuck for 302s" seen in one case). In many cases all the states have the same deny modes and there is no need to examine the entire list in recalculate_deny_mode(). In particular, recalculate_deny_mode() will only reduce the deny mode, never increase it. So if some prefix of the list causes the original deny mode to be required, there is no need to examine the remainder of the list. So we can improve recalculate_deny_mode() to usually run in constant time, so release_openowner will typically be only linear in the number of states. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 9320f27f	05-Apr-2024	Jeff Layton <jlayton@kernel.org>	nfsd: add tracepoint in mark_client_expired_locked Show client info alongside the number of cl_rpc_users. If that's elevated, then we can infer that this function returned nfserr_jukebox. [ cel: For additional debugging of RPC user refcounting ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 2d499011	05-Apr-2024	Chuck Lever <chuck.lever@oracle.com>	nfsd: new tracepoint for check_slot_seqid Replace a dprintk in check_slot_seqid with tracepoints. These new tracepoints track slot sequence numbers during operation. Suggested-by: Jeffrey Layton <jlayton@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 38f080f3	01-Apr-2024	Chuck Lever <chuck.lever@oracle.com>	NFSD: Move callback_wq into struct nfs4_client Commit 883820366747 ("nfsd: update workqueue creation") made the callback_wq single-threaded, presumably to protect modifications of cl_cb_client. See documenting comment for nfsd4_process_cb_update(). However, cl_cb_client is per-lease. There's no other reason that all callback operations need to be dispatched via a single thread. The single threading here means all client callbacks can be blocked by a problem with one client. Change the NFSv4 callback client so it serializes per-lease instead of serializing all NFSv4 callback operations on the server. Reported-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 56c35f43	07-Apr-2024	NeilBrown <neilb@suse.de>	nfsd: drop st_mutex before calling move_to_close_lru() move_to_close_lru() is currently called with ->st_mutex held. This can lead to a deadlock as move_to_close_lru() waits for sc_count to drop to 2, and some threads holding a reference might be waiting for the mutex. These references will never be dropped so sc_count will never reach 2. There can be no harm in dropping ->st_mutex before move_to_close_lru() because the only place that takes the mutex is nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called. See also https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/ where this problem was raised but not successfully resolved. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# eec76208	07-Apr-2024	NeilBrown <neilb@suse.de>	nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() move_to_close_lru() waits for sc_count to become zero while holding rp_mutex. This can deadlock if another thread holds a reference and is waiting for rp_mutex. By the time we get to move_to_close_lru() the openowner is unhashed and cannot be found any more. So code waiting for the mutex can safely retry the lookup if move_to_close_lru() has started. So change rp_mutex to an atomic_t with three states: RP_UNLOCK - state is still hashed, not locked for reply RP_LOCKED - state is still hashed, is locked for reply RP_UNHASHED - state is not hashed, no code can get a lock. Use wait_var_event() to wait for either a lock, or for the owner to be unhashed. In the latter case, retry the lookup. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# b3f03739	07-Apr-2024	NeilBrown <neilb@suse.de>	nfsd: move nfsd4_cstate_assign_replay() earlier in open handling. Rather than taking the rp_mutex (via nfsd4_cstate_assign_replay) in nfsd4_cleanup_open_state() (which seems counter-intuitive), take it and assign rp_owner as soon as possible - in nfsd4_process_open1(). This will support a future change when nfsd4_cstate_assign_replay() might fail. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 23df1778	07-Apr-2024	NeilBrown <neilb@suse.de>	nfsd: perform all find_openstateowner_str calls in the one place. Currently find_openstateowner_str look ups are done both in nfsd4_process_open1() and alloc_init_open_stateowner() - the latter possibly being a surprise based on its name. It would be easier to follow, and more conformant to common patterns, if the lookup was all in the one place. So replace alloc_init_open_stateowner() with find_or_alloc_open_stateowner() and use the latter in nfsd4_process_open1() without any calls to find_openstateowner_str(). This means all finds are find_openstateowner_str_locked() and find_openstateowner_str() is no longer needed. So discard find_openstateowner_str() and rename find_openstateowner_str_locked() to find_openstateowner_str(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 10396f4d	05-Apr-2024	Jeff Layton <jlayton@kernel.org>	nfsd: hold a lighter-weight client reference over CB_RECALL_ANY Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the client. While a callback job is technically an RPC that counter is really more for client-driven RPCs, and this has the effect of preventing the client from being unhashed until the callback completes. If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we can end up in a situation where the callback can't complete on the (now dead) callback channel, but the new client can't connect because the old client can't be unhashed. This usually manifests as a NFS4ERR_DELAY return on the CREATE_SESSION operation. The job is only holding a reference to the client so it can clear a flag after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of reference when dealing with the nfsdfs info files, but it should work appropriately here to ensure that the nfs4_client doesn't disappear. Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition") Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 99dc2ef0	26-Mar-2024	Chuck Lever <chuck.lever@oracle.com>	NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies There are one or two cases where CREATE_SESSION returns NFS4ERR_DELAY in order to force the client to wait a bit and try CREATE_SESSION again. However, after commit e4469c6cc69b ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in the CREATE_SESSION slot. Thus, when the client resends the CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY response rather than actually executing the request and properly recording its outcome. This blocks the client from making further progress. RFC 8881 Section 15.1.1.3 says: > If NFS4ERR_DELAY is returned on an operation other than SEQUENCE > that validly appears as the first operation of a request ... [t]he > request can be retried in full without modification. In this case > as well, the replier MUST avoid returning a response containing > NFS4ERR_DELAY as the response to an initial operation of a request > solely on the basis of its presence in the reply cache. Neither the original NFSD code nor the discussion in section 18.36.4 refer explicitly to this important requirement, so I missed it. Note also that not only must the server not cache NFS4ERR_DELAY, but it has to not advance the CREATE_SESSION slot sequence number so that it can properly recognize and accept the client's retry. Reported-by: Dai Ngo <dai.ngo@oracle.com> Fixes: e4469c6cc69b ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation") Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# bad4c585	03-Mar-2024	Dai Ngo <dai.ngo@oracle.com>	NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit The NFS server should ask clients to voluntarily return unused delegations when the number of granted delegations reaches the max_delegations. This is so that the server can continue to grant delegations for new requests. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 24d92de9	15-Feb-2024	Trond Myklebust <trond.myklebust@hammerspace.com>	nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr() The main point of the guarded SETATTR is to prevent races with other WRITE and SETATTR calls. That requires that the check of the guard time against the inode ctime be done after taking the inode lock. Furthermore, we need to take into account the 32-bit nature of timestamps in NFSv3, and the possibility that files may change at a faster rate than once a second. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 5826e09b	17-Feb-2024	Dai Ngo <dai.ngo@oracle.com>	NFSD: OP_CB_RECALL_ANY should recall both read and write delegations Add RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# c5967721	15-Feb-2024	Dai Ngo <dai.ngo@oracle.com>	NFSD: handle GETATTR conflict with write delegation If the GETATTR request on a file that has write delegation in effect and the request attributes include the change info and size attribute then the request is handled as below: Server sends CB_GETATTR to client to get the latest change info and file size. If these values are the same as the server's cached values then the GETATTR proceeds as normal. If either the change info or file size is different from the server's cached values, or the file was already marked as modified, then: . update time_modify and time_metadata into file's metadata with current time . encode GETATTR as normal except the file size is encoded with the value returned from CB_GETATTR . mark the file as modified If the CB_GETATTR fails for any reasons, the delegation is recalled and NFS4ERR_DELAY is returned for the GETATTR. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# b910544a	08-Feb-2024	Chuck Lever <chuck.lever@oracle.com>	NFSD: Document the phases of CREATE_SESSION As described in RFC 8881 Section 18.36.4, CREATE_SESSION can be split into four phases. NFSD's implementation now does it like that description. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# e4469c6c	08-Feb-2024	Chuck Lever <chuck.lever@oracle.com>	NFSD: Fix the NFSv4.1 CREATE_SESSION operation RFC 8881 Section 18.36.4 discusses the implementation of the NFSv4.1 CREATE_SESSION operation. The section defines four phases of operation. Phase 2 processes the CREATE_SESSION sequence ID. As a separate step, Phase 3 evaluates the CREATE_SESSION arguments. The problem we are concerned with is when phase 2 is successful but phase 3 fails. The spec language in this case is "No changes are made to any client records on the server." RFC 8881 Section 18.35.4 defines a "client record", and it does /not/ contain any details related to the special CREATE_SESSION slot. Therefore NFSD is incorrect to skip incrementing the CREATE_SESSION sequence id when phase 3 (see Section 18.36.4) of CREATE_SESSION processing fails. In other words, even though NFSD happens to store the cs_slot in a client record, in terms of the protocol the slot is logically separate from the client record. Three complications: 1. The world has moved on since commit 86c3e16cc7aa ("nfsd4: confirm only on succesful create_session") broke this. So we can't simply revert that commit. 2. NFSD's CREATE_SESSION implementation does not cleanly delineate the logic of phases 2 and 3. So this won't be a surgical fix. 3. Because of the way it currently handles the CREATE_SESSION slot sequence number, nfsd4_create_session() isn't caching error responses in the CREATE_SESSION slot. Instead of replaying the response cache in those cases, it's executing the transaction again. Reorganize the CREATE_SESSION slot sequence number accounting. This requires that error responses are appropriately cached in the CREATE_SESSION slot (once it is found). Reported-by: Connor Smith <connor.smith@hitachivantara.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218382 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 649e58d5	01-Feb-2024	Kunwu Chan <chentao@kylinos.cn>	nfsd: Simplify the allocation of slab caches in nfsd4_init_slabs Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Make the code cleaner and more readable. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 05eda6e7	30-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: don't call locks_release_private() twice concurrently It is possible for free_blocked_lock() to be called twice concurrently, once from nfsd4_lock() and once from nfsd4_release_lockowner() calling remove_blocked_locks(). This is why a kref was added. It is perfectly safe for locks_delete_block() and kref_put() to be called in parallel as they use locking or atomicity respectively as protection. However locks_release_private() has no locking. It is safe for it to be called twice sequentially, but not concurrently. This patch moves that call from free_blocked_lock() where it could race with itself, to free_nbl() where it cannot. This will slightly delay the freeing of private info or release of the owner - but not by much. It is arguably more natural for this freeing to happen in free_nbl() where the structure itself is freed. This bug was found by code inspection - it has not been seen in practice. Fixes: 47446d74f170 ("nfsd4: add refcount for nfsd4_blocked_lock") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 1e33e141	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: allow layout state to be admin-revoked. When there is layout state on a filesystem that is being "unlocked" that is now revoked, which involves closing the nfsd_file and releasing the vfs lease. To avoid races, ->ls_file can now be accessed either: - under ->fi_lock for the state's sc_file or - under rcu_read_lock() if nfsd_file_get() is used. To support this, ->fence_client and nfsd4_cb_layout_fail() now take a second argument being the nfsd_file. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 06efa667	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: allow delegation state ids to be revoked and then freed Revoking state through 'unlock_filesystem' now revokes any delegation states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. As there is already support for revoking delegations, we build on that for admin-revoking. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 39657c74	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: allow open state ids to be revoked and then freed Revoking state through 'unlock_filesystem' now revokes any open states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Possibly the related lock states should be revoked too, but a subsequent patch will do that for all lock state on the superblock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 1c13bf9f	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: allow lock state ids to be revoked and then freed Revoking state through 'unlock_filesystem' now revokes any lock states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# d688d858	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: allow admin-revoked NFSv4.0 state to be freed. For NFSv4.1 and later the client easily discovers if there is any admin-revoked state and will then find and explicitly free it. For NFSv4.0 there is no such mechanism. The client can only find that state is admin-revoked if it tries to use that state, and there is no way for it to explicitly free the state. So the server must hold on to the stateid (at least) for an indefinite amount of time. A RELEASE_LOCKOWNER request might justify forgetting some of these stateids, as would the whole clients lease lapsing, but these are not reliable. This patch takes two approaches. Whenever a client uses an revoked stateid, that stateid is then discarded and will not be recognised again. This might confuse a client which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at all, but should mostly work. Hopefully one error will lead to other resources being closed (e.g. process exits), which will result in more stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED. Also, any admin-revoked stateids that have been that way for more than one lease time are periodically revoke. No actual freeing of state happens in this patch. That will come in future patches which handle the different sorts of revoked state. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 11b2cfbf	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: report in /proc/fs/nfsd/clients/*/states when state is admin-revoke Add "admin-revoked" to the status information for any states that have been admin-revoked. This can be useful for confirming correct behaviour. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 39e1be64	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: allow state with no file to appear in /proc/fs/nfsd/clients/*/states Change the "show" functions to show some content even if a file cannot be found. This is the case for admin-revoked state. This is primarily useful for debugging - to ensure states are being removed eventually. So change several seq_printf() to seq_puts(). Some of these are needed to keep checkpatch happy. Others were done for consistency. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 1ac3629b	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: prepare for supporting admin-revocation of state The NFSv4 protocol allows state to be revoked by the admin and has error codes which allow this to be communicated to the client. This patch - introduces a new state-id status SC_STATUS_ADMIN_REVOKED which can be set on open, lock, or delegation state. - reports NFS4ERR_ADMIN_REVOKED when these are accessed - introduces a per-client counter of these states and returns SEQ4_STATUS_ADMIN_STATE_REVOKED when the counter is not zero. Decrements this when freeing any admin-revoked state. - introduces stub code to find all interesting states for a given superblock so they can be revoked via the 'unlock_filesystem' file in /proc/fs/nfsd/ No actual states are handled yet. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 3f29cc82	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: split sc_status out of sc_type sc_type identifies the type of a state - open, lock, deleg, layout - and also the status of a state - closed or revoked. This is a bit untidy and could get worse when "admin-revoked" states are added. So clean it up. With this patch, the type is now all that is stored in sc_type. This is zero when the state is first added to ->cl_stateids (causing it to be ignored), and is then set appropriately once it is fully initialised. It is set under ->cl_lock to ensure atomicity w.r.t lookup. It is now never cleared. sc_type is still a bit-set even though at most one bit is set. This allows lookup functions to be given a bitmap of acceptable types. sc_type is now an unsigned short rather than char. There is no value in restricting to just 8 bits. All the constants now start SC_TYPE_ matching the field in which they are stored. Keeping the existing names and ensuring clear separation from non-type flags would have required something like NFS4_STID_TYPE_CLOSED which is cumbersome. The "NFS4" prefix is redundant was they only appear in NFS4 code, so remove that and change STID to SC to match the field. The status is stored in a separate unsigned short named "sc_status". It has two flags: SC_STATUS_CLOSED and SC_STATUS_REVOKED. CLOSED combines NFS4_CLOSED_STID, NFS4_CLOSED_DELEG_STID, and is used for SC_TYPE_LOCK and SC_TYPE_LAYOUT instead of setting the sc_type to zero. These flags are only ever set, never cleared. For deleg stateids they are set under the global state_lock. For open and lock stateids they are set under ->cl_lock. For layout stateids they are set under ->ls_lock nfs4_unhash_stid() has been removed, and we never set sc_type = 0. This was only used for LOCK and LAYOUT stids and they now use SC_STATUS_CLOSED. Also TRACE_DEFINE_NUM() calls for the various STID #define have been removed because these things are not enums, and so that call is incorrect. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 83e73316	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: avoid race after unhash_delegation_locked() NFS4_CLOSED_DELEG_STID and NFS4_REVOKED_DELEG_STID are similar in purpose. REVOKED is used for NFSv4.1 states which have been revoked because the lease has expired. CLOSED is used in other cases. The difference has two practical effects. 1/ REVOKED states are on the ->cl_revoked list 2/ REVOKED states result in nfserr_deleg_revoked from nfsd4_verify_open_stid() and nfsd4_validate_stateid while CLOSED states result in nfserr_bad_stid. Currently a state that is being revoked is first set to "CLOSED" in unhash_delegation_locked(), then possibly to "REVOKED" in revoke_delegation(), at which point it is added to the cl_revoked list. It is possible that a stateid test could see the CLOSED state which really should be REVOKED, and so return the wrong error code. So it is safest to remove this window of inconsistency. With this patch, unhash_delegation_locked() always sets the state correctly, and revoke_delegation() no longer changes the state. Also remove a redundant test on minorversion when NFS4_REVOKED_DELEG_STID is seen - it can only be seen when minorversion is non-zero. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# c6540026	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: don't call functions with side-effecting inside WARN_ON() Code like: WARN_ON(foo()) looks like an assertion and might not be expected to have any side effects. When testing if a function with side-effects fails a construct like if (foo()) WARN_ON(1); makes the intent more obvious. nfsd has several WARN_ON calls where the test has side effects, so it would be good to change them. These cases don't really need the WARN_ON. They have never failed in 8 years of usage so let's just remove the WARN_ON wrapper. Suggested-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 77945728	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: hold ->cl_lock for hash_delegation_locked() The protocol for creating a new state in nfsd is to allocate the state leaving it largely uninitialised, add that state to the ->cl_stateids idr so as to reserve a state-id, then complete initialisation of the state and only set ->sc_type to non-zero once the state is fully initialised. If a state is found in the idr with ->sc_type == 0, it is ignored. The ->cl_lock lock is used to avoid races - it is held while checking sc_type during lookup, and held when a non-zero value is stored in ->sc_type. ... except... hash_delegation_locked() finalises the initialisation of a delegation state, but does NOT hold ->cl_lock. So this patch takes ->cl_lock at the appropriate time w.r.t other locks, and so ensures there are no races (which are extremely unlikely in any case). As ->fi_lock is often taken when ->cl_lock is held, we need to take ->cl_lock first of those two. Currently ->cl_lock and state_lock are never both taken at the same time. We need both for this patch so an arbitrary choice is needed concerning which to take first. As state_lock is more global, it might be more contended, so take it first. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 6b4ca49d	29-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: remove stale comment in nfs4_show_deleg() As we do now support write delegations, this comment is unhelpful and misleading. Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# f52f1975	25-Jan-2024	Chuck Lever <chuck.lever@oracle.com>	NFSD: Add nfsd_seq4_status trace event Add a trace point that records SEQ4_STATUS flags returned in an NFSv4.1 SEQUENCE response. SEQ4_STATUS flags report backchannel issues and changes to lease state to clients. Knowing what the server is reporting to clients is useful for debugging both configuration and operational issues in real time. For example, upcoming patches will enable server administrators to revoke parts of a client's lease; that revocation is indicated to the client when a subsequent SEQUENCE operation has one or more SEQ4_STATUS flags that are set. Sample trace records: nfsd-927 [006] 615.581821: nfsd_seq4_status: xid=0x095ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT nfsd-927 [006] 615.588043: nfsd_seq4_status: xid=0x0a5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT nfsd-928 [003] 615.588448: nfsd_seq4_status: xid=0x0b5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 4b148854	26-Jan-2024	Josef Bacik <josef@toxicpanda.com>	nfsd: make all of the nfsd stats per-network namespace We have a global set of counters that we modify for all of the nfsd operations, but now that we're exposing these stats across all network namespaces we need to make the stats also be per-network namespace. We already have some caching stats that are per-network namespace, so move these definitions into the same counter and then adjust all the helpers and users of these stats to provide the appropriate nfsd_net struct so that the stats are maintained for the per-network namespace objects. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 7b800101	05-Feb-2024	Jeff Layton <jlayton@kernel.org>	filelock: don't do security checks on nfsd setlease calls Zdenek reported seeing some AVC denials due to nfsd trying to set delegations: type=AVC msg=audit(09.11.2023 09:03:46.411:496) : avc: denied { lease } for pid=5127 comm=rpc.nfsd capability=lease scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:nfsd_t:s0 tclass=capability permissive=0 When setting delegations on behalf of nfsd, we don't want to do all of the normal capabilty and LSM checks. nfsd is a kernel thread and runs with CAP_LEASE set, so the uid checks end up being a no-op in most cases anyway. Some nfsd functions can end up running in normal process context when tearing down the server. At that point, the CAP_LEASE check can fail and cause the client to not tear down delegations when expected. Also, the way the per-fs ->setlease handlers work today is a little convoluted. The non-trivial ones are wrappers around generic_setlease, so when they fail due to permission problems they usually they end up doing a little extra work only to determine that they can't set the lease anyway. It would be more efficient to do those checks earlier. Transplant the permission checking from generic_setlease to vfs_setlease, which will make the permission checking happen earlier on filesystems that have a ->setlease operation. Add a new kernel_setlease function that bypasses these checks, and switch nfsd to use that instead of vfs_setlease. There is one behavioral change here: prior this patch the setlease_notifier would fire even if the lease attempt was going to fail the security checks later. With this change, it doesn't fire until the caller has passed them. I think this is a desirable change overall. nfsd is the only user of the setlease_notifier and it doesn't benefit from being notified about failed attempts. Cc: Ondrej Mosnáček <omosnacek@gmail.com> Reported-by: Zdenek Pytela <zpytela@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2248830 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240205-bz2248830-v1-1-d0ec0daecba1@kernel.org Acked-by: Tom Talpey <tom@talpey.com> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
# c69ff407	31-Jan-2024	Jeff Layton <jlayton@kernel.org>	filelock: split leases out of struct file_lock Add a new struct file_lease and move the lease-specific fields from struct file_lock to it. Convert the appropriate API calls to take struct file_lease instead, and convert the callers to use them. There is zero overlap between the lock manager operations for file locks and the ones for file leases, so split the lease-related operations off into a new lease_manager_operations struct. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-47-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
# 05580bbf	31-Jan-2024	Jeff Layton <jlayton@kernel.org>	nfsd: adapt to breakup of struct file_lock Most of the existing APIs have remained the same, but subsystems that access file_lock fields directly need to reach into struct file_lock_core now. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-42-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
# 60f3154d	31-Jan-2024	Jeff Layton <jlayton@kernel.org>	nfsd: convert to using new filelock helpers Convert to using the new file locking helper functions. Also, in later patches we're going to introduce some macros with names that clash with the variable names in nfsd4_lock. Rename them. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-12-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
# 5ea9a7c5	04-Feb-2024	NeilBrown <neilb@suse.de>	nfsd: don't take fi_lock in nfsd_break_deleg_cb() A recent change to check_for_locks() changed it to take ->flc_lock while holding ->fi_lock. This creates a lock inversion (reported by lockdep) because there is a case where ->fi_lock is taken while holding ->flc_lock. ->flc_lock is held across ->fl_lmops callbacks, and nfsd_break_deleg_cb() is one of those and does take ->fi_lock. However it doesn't need to. Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list and so needed the lock. Since then it doesn't walk the list and doesn't need the lock. Two actions are performed under the lock. One is to call nfsd_break_one_deleg which calls nfsd4_run_cb(). These doesn't act on the nfs4_file at all, so don't need the lock. The other is to set ->fi_had_conflict which is in the nfs4_file. This field is only ever set here (except when initialised to false) so there is no possible problem will multiple threads racing when setting it. The field is tested twice in nfs4_set_delegation(). The first test does not hold a lock and is documented as an opportunistic optimisation, so it doesn't impose any need to hold ->fi_lock while setting ->fi_had_conflict. The second test in nfs4_set_delegation() is make under ->fi_lock, so removing the locking when ->fi_had_conflict is set could make a change. The change could only be interesting if ->fi_had_conflict tested as false even though nfsd_break_one_deleg() ran before ->fi_lock was unlocked. i.e. while hash_delegation_locked() was running. As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb() there can be no importance to this interaction. So this patch removes the locking from nfsd_break_one_deleg() and moves the final test on ->fi_had_conflict out of the locked region to make it clear that locking isn't important to the test. It is still tested after vfs_setlease() has succeeded. This might be significant and as vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called under ->flc_lock this "after" is a true ordering provided by a spinlock. Fixes: edcf9725150e ("nfsd: fix RELEASE_LOCKOWNER") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# edcf9725	21-Jan-2024	NeilBrown <neilb@suse.de>	nfsd: fix RELEASE_LOCKOWNER The test on so_count in nfsd4_release_lockowner() is nonsense and harmful. Revert to using check_for_locks(), changing that to not sleep. First: harmful. As is documented in the kdoc comment for nfsd4_release_lockowner(), the test on so_count can transiently return a false positive resulting in a return of NFS4ERR_LOCKS_HELD when in fact no locks are held. This is clearly a protocol violation and with the Linux NFS client it can cause incorrect behaviour. If RELEASE_LOCKOWNER is sent while some other thread is still processing a LOCK request which failed because, at the time that request was received, the given owner held a conflicting lock, then the nfsd thread processing that LOCK request can hold a reference (conflock) to the lock owner that causes nfsd4_release_lockowner() to return an incorrect error. The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so it knows that the error is impossible. It assumes the lock owner was in fact released so it feels free to use the same lock owner identifier in some later locking request. When it does reuse a lock owner identifier for which a previous RELEASE failed, it will naturally use a lock_seqid of zero. However the server, which didn't release the lock owner, will expect a larger lock_seqid and so will respond with NFS4ERR_BAD_SEQID. So clearly it is harmful to allow a false positive, which testing so_count allows. The test is nonsense because ... well... it doesn't mean anything. so_count is the sum of three different counts. 1/ the set of states listed on so_stateids 2/ the set of active vfs locks owned by any of those states 3/ various transient counts such as for conflicting locks. When it is tested against '2' it is clear that one of these is the transient reference obtained by find_lockowner_str_locked(). It is not clear what the other one is expected to be. In practice, the count is often 2 because there is precisely one state on so_stateids. If there were more, this would fail. In my testing I see two circumstances when RELEASE_LOCKOWNER is called. In one case, CLOSE is called before RELEASE_LOCKOWNER. That results in all the lock states being removed, and so the lockowner being discarded (it is removed when there are no more references which usually happens when the lock state is discarded). When nfsd4_release_lockowner() finds that the lock owner doesn't exist, it returns success. The other case shows an so_count of '2' and precisely one state listed in so_stateid. It appears that the Linux client uses a separate lock owner for each file resulting in one lock state per lock owner, so this test on '2' is safe. For another client it might not be safe. So this patch changes check_for_locks() to use the (newish) find_any_file_locked() so that it doesn't take a reference on the nfs4_file and so never calls nfsd_file_put(), and so never sleeps. With this check is it safe to restore the use of check_for_locks() rather than testing so_count against the mysterious '2'. Fixes: ce3c4ad7f4ce ("NFSD: Fix possible sleep during nfsd4_release_lockowner()") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org # v6.2+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
# 3c86e615	04-Dec-2023	Dan Carpenter <dan.carpenter@linaro.org>	nfsd: remove unnecessary NULL check We check "state" for NULL on the previous line so it can't be NULL here. No need to check again. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202312031425.LffZTarR-lkp@intel.com/ Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>