History log of /linux-master/fs/xfs/xfs_trans.h
Revision Date Author Comments
# f2e812c1 18-Mar-2024 Dave Chinner <dchinner@redhat.com>

xfs: don't use current->journal_info

syzbot reported an ext4 panic during a page fault where found a
journal handle when it didn't expect to find one. The structure
it tripped over had a value of 'TRAN' in the first entry in the
structure, and that indicates it tripped over a struct xfs_trans
instead of a jbd2 handle.

The reason for this is that the page fault was taken during a
copy-out to a user buffer from an xfs bulkstat operation. XFS uses
an "empty" transaction context for bulkstat to do automated metadata
buffer cleanup, and so the transaction context is valid across the
copyout of the bulkstat info into the user buffer.

We are using empty transaction contexts like this in XFS to reduce
the risk of failing to release objects we reference during the
operation, especially during error handling. Hence we really need to
ensure that we can take page faults from these contexts without
leaving landmines for the code processing the page fault to trip
over.

However, this same behaviour could happen from any other filesystem
that triggers a page fault or any other exception that is handled
on-stack from within a task context that has current->journal_info
set. Having a page fault from some other filesystem bounce into XFS
where we have to run a transaction isn't a bug at all, but the usage
of current->journal_info means that this could result corruption of
the outer task's journal_info structure.

The problem is purely that we now have two different contexts that
now think they own current->journal_info. IOWs, no filesystem can
allow page faults or on-stack exceptions while current->journal_info
is set by the filesystem because the exception processing might use
current->journal_info itself.

If we end up with nested XFS transactions whilst holding an empty
transaction, then it isn't an issue as the outer transaction does
not hold a log reservation. If we ignore the current->journal_info
usage, then the only problem that might occur is a deadlock if the
exception tries to take the same locks the upper context holds.
That, however, is not a problem that setting current->journal_info
would solve, so it's largely an irrelevant concern here.

IOWs, we really only use current->journal_info for a warning check
in xfs_vm_writepages() to ensure we aren't doing writeback from a
transaction context. Writeback might need to do allocation, so it
can need to run transactions itself. Hence it's a debug check to
warn us that we've done something silly, and largely it is not all
that useful.

So let's just remove all the use of current->journal_info in XFS and
get rid of all the potential issues from nested contexts where
current->journal_info might get misused by another filesystem
context.

Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>


# 0dc63c8a 22-Feb-2024 Darrick J. Wong <djwong@kernel.org>

xfs: launder in-memory btree buffers before transaction commit

As we've noted in various places, all current users of in-memory btrees
are online fsck. Online fsck only stages a btree long enough to rebuild
an ondisk data structure, which means that the in-memory btree is
ephemeral. Furthermore, if we encounter /any/ errors while updating an
in-memory btree, all we do is tear down all the staged data and return
an errno to userspace. In-memory btrees need not be transactional, so
their buffers should not be committed to the ondisk log, nor should they
be checkpointed by the AIL. That's just as well since the ephemeral
nature of the btree means that the buftarg and the buffers may disappear
quickly anyway.

Therefore, we need a way to launder the btree buffers that get attached
to the transaction by the generic btree code. Because the buffers are
directly mapped to backing file pages, there's no need to bwrite them
back to the tmpfs file. All we need to do is clean enough of the buffer
log item state so that the bli can be detached from the buffer, remove
the bli from the transaction's log item list, and reset the transaction
dirty state as if the laundered items had never been there.

For simplicity, create xfbtree transaction commit and cancel helpers
that launder the in-memory btree buffers for callers. Once laundered,
call the write verifier on non-stale buffers to avoid integrity issues,
or punch a hole in the backing file for stale buffers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 8f71bede 15-Dec-2023 Darrick J. Wong <djwong@kernel.org>

xfs: repair inode fork block mapping data structures

Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# a49c708f 30-Nov-2023 Darrick J. Wong <djwong@kernel.org>

xfs: move ->iop_relog to struct xfs_defer_op_type

The only log items that need relogging are the ones created for deferred
work operations, and the only part of the code base that relogs log
items is the deferred work machinery. Move the function pointers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# bd3a88f6 30-Nov-2023 Darrick J. Wong <djwong@kernel.org>

xfs: use xfs_defer_create_done for the relogging operation

Now that we have a helper to handle creating a log intent done item and
updating all the necessary state flags, use it to reduce boilerplate in
the ->iop_relog implementations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# db7ccc0b 22-Nov-2023 Darrick J. Wong <djwong@kernel.org>

xfs: move ->iop_recover to xfs_defer_op_type

Finish off the series by moving the intent item recovery function
pointer to the xfs_defer_op_type struct, since this is really a deferred
work function now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# a050acdf 22-Nov-2023 Darrick J. Wong <djwong@kernel.org>

xfs: pass the xfs_defer_pending object to iop_recover

Now that log intent item recovery recreates the xfs_defer_pending state,
we should pass that into the ->iop_recover routines so that the intent
item can finish the recreation work.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 692b6cdd 10-Feb-2023 Dave Chinner <dchinner@redhat.com>

xfs: t_firstblock is tracking AGs not blocks

The tp->t_firstblock field is now raelly tracking the highest AG we
have locked, not the block number of the highest allocation we've
made. It's purpose is to prevent AGF locking deadlocks, so rename it
to "highest AG" and simplify the implementation to just track the
agno rather than a fsbno.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# fad743d7 13-Jul-2022 Dave Chinner <dchinner@redhat.com>

xfs: add log item precommit operation

For inodes that are dirty, we have an attached cluster buffer that
we want to use to track the dirty inode through the AIL.
Unfortunately, locking the cluster buffer and adding it to the
transaction when the inode is first logged in a transaction leads to
buffer lock ordering inversions.

The specific problem is ordering against the AGI buffer. When
modifying unlinked lists, the buffer lock order is AGI -> inode
cluster buffer as the AGI buffer lock serialises all access to the
unlinked lists. Unfortunately, functionality like xfs_droplink()
logs the inode before calling xfs_iunlink(), as do various directory
manipulation functions. The inode can be logged way down in the
stack as far as the bmapi routines and hence, without a major
rewrite of lots of APIs there's no way we can avoid the inode being
logged by something until after the AGI has been logged.

As we are going to be using ordered buffers for inode AIL tracking,
there isn't a need to actually lock that buffer against modification
as all the modifications are captured by logging the inode item
itself. Hence we don't actually need to join the cluster buffer into
the transaction until just before it is committed. This means we do
not perturb any of the existing buffer lock orders in transactions,
and the inode cluster buffer is always locked last in a transaction
that doesn't otherwise touch inode cluster buffers.

We do this by introducing a precommit log item method. This commit
just introduces the mechanism; the inode item implementation is in
followup commits.

The precommit items need to be sorted into consistent order as we
may be locking multiple items here. Hence if we have two dirty
inodes in cluster buffers A and B, and some other transaction has
two separate dirty inodes in the same cluster buffers, locking them
in different orders opens us up to ABBA deadlocks. Hence we sort the
items on the transaction based on the presence of a sort log item
method.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 016a2338 07-Jul-2022 Dave Chinner <dchinner@redhat.com>

xfs: Add order IDs to log items in CIL

Before we split the ordered CIL up into per cpu lists, we need a
mechanism to track the order of the items in the CIL. We need to do
this because there are rules around the order in which related items
must physically appear in the log even inside a single checkpoint
transaction.

An example of this is intents - an intent must appear in the log
before it's intent done record so that log recovery can cancel the
intent correctly. If we have these two records misordered in the
CIL, then they will not be recovered correctly by journal replay.

We also will not be able to move items to the tail of
the CIL list when they are relogged, hence the log items will need
some mechanism to allow the correct log item order to be recreated
before we write log items to the hournal.

Hence we need to have a mechanism for recording global order of
transactions in the log items so that we can recover that order
from un-ordered per-cpu lists.

Do this with a simple monotonic increasing commit counter in the CIL
context. Each log item in the transaction gets stamped with the
current commit order ID before it is added to the CIL. If the item
is already in the CIL, leave it where it is instead of moving it to
the tail of the list and instead sort the list before we start the
push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 0d227466 03-May-2022 Dave Chinner <dchinner@redhat.com>

xfs: intent item whiteouts

When we log modifications based on intents, we add both intent
and intent done items to the modification being made. These get
written to the log to ensure that the operation is re-run if the
intent done is not found in the log.

However, for operations that complete wholly within a single
checkpoint, the change in the checkpoint is atomic and will never
need replay. In this case, we don't need to actually write the
intent and intent done items to the journal because log recovery
will never need to manually restart this modification.

Log recovery currently handles intent/intent done matching by
inserting the intent into the AIL, then removing it when a matching
intent done item is found. Hence for all the intent-based operations
that complete within a checkpoint, we spend all that time parsing
the intent/intent done items just to cancel them and do nothing with
them.

Hence it follows that the only time we actually need intents in the
log is when the modification crosses checkpoint boundaries in the
log and so may only be partially complete in the journal. Hence if
we commit and intent done item to the CIL and the intent item is in
the same checkpoint, we don't actually have to write them to the
journal because log recovery will always cancel the intents.

We've never really worried about the overhead of logging intents
unnecessarily like this because the intents we log are generally
very much smaller than the change being made. e.g. freeing an extent
involves modifying at lease two freespace btree blocks and the AGF,
so the EFI/EFD overhead is only a small increase in space and
processing time compared to the overall cost of freeing an extent.

However, delayed attributes change this cost equation dramatically,
especially for inline attributes. In the case of adding an inline
attribute, we only log the inode core and attribute fork at present.
With delayed attributes, we now log the attr intent which includes
the name and value, the inode core adn attr fork, and finally the
attr intent done item. We increase the number of items we log from 1
to 3, and the number of log vectors (regions) goes up from 3 to 7.
Hence we tripple the number of objects that the CIL has to process,
and more than double the number of log vectors that need to be
written to the journal.

At scale, this means delayed attributes cause a non-pipelined CIL to
become CPU bound processing all the extra items, resulting in a > 40%
performance degradation on 16-way file+xattr create worklaods.
Pipelining the CIL (as per 5.15) reduces the performance degradation
to 20%, but now the limitation is the rate at which the log items
can be written to the iclogs and iclogs be dispatched for IO and
completed.

Even log IO completion is slowed down by these intents, because it
now has to process 3x the number of items in the checkpoint.
Processing completed intents is especially inefficient here, because
we first insert the intent into the AIL, then remove it from the AIL
when the intent done is processed. IOWs, we are also doing expensive
operations in log IO completion we could completely avoid if we
didn't log completed intent/intent done pairs.

Enter log item whiteouts.

When an intent done is committed, we can check to see if the
associated intent is in the same checkpoint as we are currently
committing the intent done to. If so, we can mark the intent log
item with a whiteout and immediately free the intent done item
rather than committing it to the CIL. We can basically skip the
entire formatting and CIL insertion steps for the intent done item.

However, we cannot remove the intent item from the CIL at this point
because the unlocked per-cpu CIL item lists do not permit removal
without holding the CIL context lock exclusively. Transaction commit
only holds the context lock shared, hence the best we can do is mark
the intent item with a whiteout so that the CIL push can release it
rather than writing it to the log.

This means we never write the intent to the log if the intent done
has also been committed to the same checkpoint, but we'll always
write the intent if the intent done has not been committed or has
been committed to a different checkpoint. This will result in
correct log recovery behaviour in all cases, without the overhead of
logging unnecessary intents.

This intent whiteout concept is generic - we can apply it to all
intent/intent done pairs that have a direct 1:1 relationship. The
way deferred ops iterate and relog intents mean that all intents
currently have a 1:1 relationship with their done intent, and hence
we can apply this cancellation to all existing intent/intent done
implementations.

For delayed attributes with a 16-way 64kB xattr create workload,
whiteouts reduce the amount of journalled metadata from ~2.5GB/s
down to ~600MB/s and improve the creation rate from 9000/s to
14000/s.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# c23ab603 03-May-2022 Dave Chinner <dchinner@redhat.com>

xfs: add log item method to return related intents

To apply a whiteout to an intent item when an intent done item is
committed, we need to be able to retrieve the intent item from the
the intent done item. Add a log item op method for doing this, and
wire all the intent done items up to it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# f5b81200 03-May-2022 Dave Chinner <dchinner@redhat.com>

xfs: add log item flags to indicate intents

We currently have a couple of helper functions that try to infer
whether the log item is an intent or intent done item from the
combinations of operations it supports. This is incredibly fragile
and not very efficient as it requires checking specific combinations
of ops.

We need to be able to identify intent and intent done items quickly
and easily in upcoming patches, so simply add intent and intent done
type flags to the log item ops flags. These are static flags to
begin with, so intent items should have been typed like this from
the start.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 22d53f48 20-Apr-2022 Dave Chinner <dchinner@redhat.com>

xfs: convert log item tracepoint flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# b9b3fe15 20-Apr-2022 Dave Chinner <david@fromorbit.com>

xfs: convert buffer flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned. This manifests as a compiler error such as:

/kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
{ XBF_UNMAPPED, "UNMAPPED" }
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class':
/kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
#define XBF_UNMAPPED (1 << 31)/* do not map the buffer */

as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an
unsigned long for the flag. Since this results in the value of
XBF_UNMAPPED causing a signed integer overflow, the result is
technically undefined behavior, which gcc-5 does not accept as an
integer constant.

This is based on a patch from Arnd Bergman <arnd@arndb.de>.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# d86142dd 17-Mar-2022 Dave Chinner <dchinner@redhat.com>

xfs: log items should have a xlog pointer, not a mount

Log items belong to the log, not the xfs_mount. Convert the mount
pointer in the log item to a xlog pointer in preparation for
upcoming log centric changes to the log items.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 871b9316 25-Feb-2022 Darrick J. Wong <djwong@kernel.org>

xfs: reserve quota for dir expansion when linking/unlinking files

XFS does not reserve quota for directory expansion when linking or
unlinking children from a directory. This means that we don't reject
the expansion with EDQUOT when we're at or near a hard limit, which
means that unprivileged userspace can use link()/unlink() to exceed
quota.

The fix for this is nuanced -- link operations don't always expand the
directory, and we allow a link to proceed with no space reservation if
we don't need to add a block to the directory to handle the addition.
Unlink operations generally do not expand the directory (you'd have to
free a block and then cause a btree split) and we can defer the
directory block freeing if there is no space reservation.

Moreover, there is a further bug in that we do not trigger the blockgc
workers to try to clear space when we're out of quota.

To fix both cases, create a new xfs_trans_alloc_dir function that
allocates the transaction, locks and joins the inodes, and reserves
quota for the directory. If there isn't sufficient space or quota,
we'll switch the caller to reservationless mode. This should prevent
quota usage overruns with the least restriction in functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 182696fb 12-Oct-2021 Darrick J. Wong <djwong@kernel.org>

xfs: rename _zone variables to _cache

Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>


# e7720afa 27-Sep-2021 Darrick J. Wong <djwong@kernel.org>

xfs: remove kmem_zone typedef

Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>


# c5db9f93 16-Sep-2021 Darrick J. Wong <djwong@kernel.org>

xfs: formalize the process of holding onto resources across a defer roll

Transaction users are allowed to flag up to two buffers and two inodes
for ownership preservation across a deferred transaction roll. Hoist
the variables and code responsible for this out of xfs_defer_trans_roll
so that we can use it for the defer capture mechanism.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>


# 5f9b4b0d 18-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: xfs_log_force_lsn isn't passed a LSN

In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.

xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.

The problem is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.

As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.

This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.

Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 1aec7c3d 23-Apr-2021 Darrick J. Wong <djwong@kernel.org>

xfs: remove obsolete AGF counter debugging

In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to
remove blocks from an overfilled AGFL because there were complaints
about transaction overruns that stemmed from trying to free multiple
blocks in a single transaction.

Unfortunately, that commit missed a subtlety in the debug-mode
transaction accounting when a realtime volume is attached. If a
realtime file undergoes a data fork mapping change such that realtime
extents are allocated (or freed) in the same transaction that a data
device block is also allocated (or freed), we can trip a debugging
assertion. This can happen (for example) if a realtime extent is
allocated and it is necessary to reshape the bmbt to hold the new
mapping.

When we go to allocate a bmbt block from an AG, the first thing the data
device block allocator does is ensure that the freelist is the proper
length. If the freelist is too long, it will trim the freelist to the
proper length.

In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
record the decrement in the AG free list count. Prior to f8f28 we would
put the free block back in the free space btrees in the same
transaction, which calls xfs_trans_agblocks_delta() to record the
increment in the AG free block count. Since AGFL blocks are included in
the global free block count (fdblocks), there is no corresponding
fdblocks update, so the AGFL free satisfies the following condition in
xfs_trans_apply_sb_deltas:

/*
* Check that superblock mods match the mods made to AGF counters.
*/
ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
(tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
tp->t_ag_btree_delta));

The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
the number blocks that were allocated.

After commit f8f28 we defer the block freeing to the next chained
transaction, which means that the calls to xfs_trans_agflist_delta and
xfs_trans_agblocks_delta occur in separate transactions. The (first)
transaction that shortens the free list trips on the comparison, which
has now become:

(X + 0) == ((X) + -1 + 0)

because we haven't freed the AGFL block yet; we've only logged an
intention to free it. When the second transaction (the deferred free)
commits, it will evaluate the expression as:

(0 + 0) == (1 + 0 + 0)

and trip over that in turn.

At this point, the astute reader may note that the two commits tagged by
this patch have been in the kernel for a long time but haven't generated
any bug reports. How is it that the author became aware of this bug?

This originally surfaced as an intermittent failure when I was testing
realtime rmap, but a different bug report by Zorro Lang reveals the same
assertion occuring on !lazysbcount filesystems.

The common factor to both reports (and why this problem wasn't
previously reported) becomes apparent if we consider when
xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():

if (tp->t_flags & XFS_TRANS_SB_DIRTY)
xfs_trans_apply_sb_deltas(tp);

With a modern lazysbcount filesystem, transactions update only the
percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
xfs_trans_apply_sb_deltas is rarely called.

However, updates to the count of free realtime extents are not part of
lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
removing data fork mappings to realtime files; similarly,
XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.

Dave mentioned in response to an earlier version of this patch:

"IIUC, what you are saying is that this debug code is simply not
exercised in normal testing and hasn't been for the past decade? And it
still won't be exercised on anything other than realtime device testing?

"...it was debugging code from 1994 that was largely turned into dead
code when lazysbcounters were introduced in 2007. Hence I'm not sure it
holds any value anymore."

This debugging code isn't especially helpful - you can modify the
flcount on one AG and the freeblks of another AG, and it won't trigger.
Add the fact that nobody noticed for a decade, and let's just get rid of
it (and start testing realtime :P).

This bug was found by running generic/051 on either a V4 filesystem
lacking lazysbcount; or a V5 filesystem with a realtime volume.

Cc: bfoster@redhat.com, zlang@redhat.com
Fixes: f8f2835a9cf3 ("xfs: defer agfl block frees when dfops is available")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 756b1c34 23-Feb-2021 Dave Chinner <dchinner@redhat.com>

xfs: use current->journal_info for detecting transaction recursion

Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.

[djwong: This change also realigns the lifetime of NOFS flag changes to
match the incore transaction, instead of the inconsistent scheme we have
now.]

Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 7317a03d 29-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: refactor inode ownership change transaction/inode/quota allocation idiom

For file ownership (uid, gid, prid) changes, create a new helper
xfs_trans_alloc_ichange that allocates a transaction and reserves the
appropriate amount of quota against that transction in preparation for a
change of user, group, or project id. Replace all the open-coded idioms
with a single call to this helper so that we can contain the retry loops
in the next patchset.

This changes the locking behavior for ichange transactions slightly.
Since tr_ichange does not have a permanent reservation and cannot roll,
we pass XFS_ILOCK_EXCL to ijoin so that the inode will be unlocked
automatically at commit time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# f2f7b9ff 27-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: refactor inode creation transaction/inode/quota allocation idiom

For file creation, create a new helper xfs_trans_alloc_icreate that
allocates a transaction and reserves the appropriate amount of quota
against that transction. Replace all the open-coded idioms with a
single call to this helper so that we can contain the retry loops in the
next patchset.

This changes the locking behavior for non-tempfile creation slightly, in
that we now make the quota reservation without holding the directory
ILOCK. While the dquots chosen for inode creation are based on the
directory state at a given point in time, the directory ILOCK was
released as soon as the dquot references are picked up. Hence it was
never necessary to hold the directory ILOCK for the quota reservation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 3de4eb10 26-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: allow reservation of rtblocks with xfs_trans_alloc_inode

Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode
wrapper function, then convert a few more callsites.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 3a1af6c3 26-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: refactor common transaction/inode/quota allocation idiom

Create a new helper xfs_trans_alloc_inode that allocates a transaction,
locks and joins an inode to it, and then reserves the appropriate amount
of quota against that transction. Then replace all the open-coded
idioms with a single call to this helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 4e919af7 27-Sep-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: periodically relog deferred intent items

There's a subtle design flaw in the deferred log item code that can lead
to pinning the log tail. Taking up the defer ops chain examples from
the previous commit, we can get trapped in sequences like this:

Caller hands us a transaction t0 with D0-D3 attached. The defer ops
chain will look like the following if the transaction rolls succeed:

t1: D0(t0), D1(t0), D2(t0), D3(t0)
t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

In transaction 9, we finish d9 and try to roll to t10 while holding onto
an intent item for D3 that we logged in t0.

The previous commit changed the order in which we place new defer ops in
the defer ops processing chain to reduce the maximum chain length. Now
make xfs_defer_finish_noroll capable of relogging the entire chain
periodically so that we can always move the log tail forward. Most
chains will never get relogged, except for operations that generate very
long chains (large extents containing many blocks with different sharing
levels) or are on filesystems with small logs and a lot of ongoing
metadata updates.

Callers are now required to ensure that the transaction reservation is
large enough to handle logging done items and new intent items for the
maximum possible chain length. Most callers are careful to keep the
chain lengths low, so the overhead should be minimal.

The decision to relog an intent item is made based on whether the intent
was logged in a previous checkpoint, since there's no point in relogging
an intent into the same checkpoint.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# e6fff81e 25-Sep-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: proper replay of deferred ops queued during log recovery

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items. As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items. Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items. After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent. Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 901219bb 28-Sep-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove XFS_LI_RECOVERED

The ->iop_recover method of a log intent item removes the recovered
intent item from the AIL by logging an intent done item and committing
the transaction, so it's superfluous to have this flag check. Nothing
else uses it, so get rid of the flag entirely.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# d6b8fc6c 23-Sep-2020 Kaixu Xia <kaixuxia@tencent.com>

xfs: do the assert for all the log done items in xfs_trans_cancel

We should do the assert for all the log intent-done items if they appear
here. This patch detect intent-done items by the fact that their item ops
don't have iop_unpin and iop_push methods and also move the helper
xlog_item_is_intent to xfs_trans.h.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# cead0b10 01-Sep-2020 Christoph Hellwig <hch@lst.de>

xfs: simplify xfs_trans_getsb

Remove the mp argument as this function is only called in transaction
context, and open code xfs_getsb given that the function already accesses
the buffer pointer in the mount point directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3536b61e 29-Jun-2020 Dave Chinner <dchinner@redhat.com>

xfs: unwind log item error flagging

When an buffer IO error occurs, we want to mark all
the log items attached to the buffer as failed. Open code
the error handling loop so that we can modify the flagging for the
different types of objects directly and independently of each other.

This also allows us to remove the ->iop_error method from the log
item operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 2ef3f7f5 29-Jun-2020 Dave Chinner <dchinner@redhat.com>

xfs: get rid of log item callbacks

They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 889eb55d 01-May-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: refactor intent item RECOVERED flag into the log item

Rename XFS_{EFI,BUI,RUI,CUI}_RECOVERED to XFS_LI_RECOVERED so that we
track recovery status in the log item, then get rid of the now unused
flags fields in each of those log item types.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 154c733a 01-May-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: refactor releasing finished intents during log recovery

Replace the open-coded AIL item walking with a proper helper when we're
trying to release an intent item that has been finished. We add a new
->iop_match method to decide if an intent item matches a supplied ID.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 10d0c6e0 01-May-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: refactor recovered EFI log item playback

Move the code that processes the log items created from the recovered
log items into the per-item source code files and use dispatch functions
to call them. No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# ce92464c 23-Jan-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: make xfs_trans_get_buf return an error code

Convert xfs_trans_get_buf() to return numeric error codes like most
everywhere else in xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 9676b54e 23-Jan-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: make xfs_trans_get_buf_map return an error code

Convert xfs_trans_get_buf_map() to return numeric error codes like most
everywhere else in xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# caeaea98 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: merge xfs_trans_bmap.c into xfs_bmap_item.c

Keep all bmap item related code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3cfce1e3 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: merge xfs_trans_rmap.c into xfs_rmap_item.c

Keep all rmap item related code together in one file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# effd5e96 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: merge xfs_trans_refcount.c into xfs_refcount_item.c

Keep all the refcount item related code together in one file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 81f40041 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: merge xfs_trans_extfree.c into xfs_extfree_item.c

Keep all the extree item related code together in one file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# efe2330f 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: remove the xfs_log_item_t typedef

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9ce632a2 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: add a flag to release log items on commit

We have various items that are released from ->iop_comitting. Add a
flag to just call ->iop_release from the commit path to avoid tons
of boilerplate code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# ddf92053 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: split iop_unlock

The iop_unlock method is called when comitting or cancelling a
transaction. In the latter case, the transaction may or may not be
aborted. While there is no known problem with the current code in
practice, this implementation is limited in that any log item
implementation that might want to differentiate between a commit and a
cancellation must rely on the aborted state. The aborted bit is only
set when the cancelled transaction is dirty, however. This means that
there is no way to distinguish between a commit and a clean transaction
cancellation.

For example, intent log items currently rely on this distinction. The
log item is either transferred to the CIL on commit or released on
transaction cancel. There is currently no possibility for a clean intent
log item in a transaction, but if that state is ever introduced a cancel
of such a transaction will immediately result in memory leaks of the
associated log item(s). This is an interface deficiency and landmine.

To clean this up, replace the iop_unlock method with an iop_release
method that is specific to transaction cancel. The existing
iop_committing method occurs at the same time as iop_unlock in the
commit path and there is no need for two separate callbacks here.
Overload the iop_committing method with the current commit time
iop_unlock implementations to eliminate the need for the latter and
further simplify the interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 8c9ce2f7 12-Jun-2019 Eric Sandeen <sandeen@sandeen.net>

xfs: remove unused flags arg from getsb interfaces

The flags value is always passed as 0 so remove the argument.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 66e3237e 12-Dec-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: const-ify xfs_owner_info arguments

Only certain functions actually change the contents of an
xfs_owner_info; the rest can accept a const struct pointer. This will
enable us to save stack space by hoisting static owner info types to
be const global variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# bc9f2b7c 12-Dec-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: idiotproof defer op type configuration

Recently, we forgot to port a new defer op type to xfsprogs, which
caused us some userspace pain. Reorganize the way we make libxfs
clients supply defer op type information so that all type information
has to be provided at build time instead of risky runtime dynamic
configuration.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 38b6238e 18-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix buffer state management in xrep_findroot_block

We don't handle buffer state properly in online repair's findroot
routine. If a buffer already has b_ops set, we don't ever want to touch
that, and we don't want to call the read verifiers on a buffer that
could be dirty (CRCs are only recomputed during log checkpoints).

Therefore, be more careful about what we do with a buffer -- if someone
else already attached ops that are not the ones for this btree type,
just ignore the buffer. We only attach our btree type's buf ops if it
matches the magic/uuid and structure checks.

We also modify xfs_buf_read_map to allow callers to set buffer ops on a
DONE buffer with NULL ops so that repair doesn't leave behind buffers
which won't have buffers attached to them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9d9e6233 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: fold dfops into the transaction

struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.

Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 1ae093cb 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: replace xfs_defer_ops ->dop_pending with on-stack list

The xfs_defer_ops ->dop_pending list is used to track active
deferred operations once intents are logged. These items must be
aborted in the event of an error. The list is populated as intents
are logged and items are removed as they complete (or are aborted).

Now that xfs_defer_finish() cancels on error, there is no need to
ever access ->dop_pending outside of xfs_defer_finish(). The list is
only ever populated after xfs_defer_finish() begins and is either
completed or cancelled before it returns.

Remove ->dop_pending from xfs_defer_ops and replace it with a local
list in the xfs_defer_finish() path. Pass the local list to the
various helpers now that it is not accessible via dfops. Note that
we have to check for NULL in the abort case as the final tx roll
occurs outside of the scope of the new local list (once the dfops
has completed and thus drained the list).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 7dbddbac 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: drop dop param from xfs_defer_op_type ->finish_item() callback

The dfops infrastructure ->finish_item() callback passes the
transaction and dfops as separate parameters. Since dfops is always
part of a transaction, the latter parameter is no longer necessary.
Remove it from the various callbacks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a8198666 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: automatic dfops inode relogging

Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.

Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 82ff27bc 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: automatic dfops buffer relogging

Buffers that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While buffers are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for held buffers.

Replace the xfs_defer_bjoin() infrastructure with such detection and
automatic relogging of held buffers. This eliminates the need for
the per-dfops buffer list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 1214f1cf 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: replace dop_low with transaction flag

The dop_low field enables the low free space allocation mode when a
previous allocation has detected difficulty allocating blocks. It
has historically been part of the xfs_defer_ops structure, which
means if enabled, it remains enabled across a set of transactions
until the deferred operations have completed and the dfops is reset.

Now that the dfops is embedded in the transaction, we can save a bit
more space by using a transaction flag rather than a standalone
boolean. Drop the ->dop_low field and replace it with a transaction
flag that is set at the same points, carried across rolling
transactions and cleared on completion of deferred operations. This
essentially emulates the behavior of ->dop_low and so should not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 7279aa13 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: remove unused __xfs_defer_cancel() internal helper

With no more external dfops users, there is no need for an
xfs_defer_ops cancel wrapper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9e28a242 24-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: drop unnecessary xfs_defer_finish() dfops parameter

Every caller of xfs_defer_finish() now passes the transaction and
its associated ->t_dfops. The xfs_defer_ops parameter is therefore
no longer necessary and can be removed.

Since most xfs_defer_finish() callers also have to consider
xfs_defer_cancel() on error, update the latter to also receive the
transaction for consistency. The log recovery code contains an
outlier case that cancels a dfops directly without an available
transaction. Retain an internal wrapper to support this outlier case
for the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e021a2e5 24-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: support embedded dfops in transaction

The dfops structure used by multi-transaction operations is
typically stored on the stack and carried around by the associated
transaction. The lifecycle of dfops does not quite match that of the
transaction, but they are tightly related in that the former depends
on the latter.

The relationship of these objects is tight enough that we can avoid
the cumbersome boilerplate code required in most cases to manage
them separately by just embedding an xfs_defer_ops in the
transaction itself. This means that a transaction allocation returns
with an initialized dfops, a transaction commit finishes pending
deferred items before the tx commit, a transaction cancel cancels
the dfops before the transaction and a transaction dup operation
transfers the current dfops state to the new transaction.

The dup operation is slightly complicated by the fact that we can no
longer just copy a dfops pointer from the old transaction to the new
transaction. This is solved through a dfops move helper that
transfers the pending items and other dfops state across the
transactions. This also requires that transaction rolling code
always refer to the transaction for the current dfops reference.

Finally, to facilitate incremental conversion to the internal dfops
and continue to support the current external dfops mode of
operation, create the new ->t_dfops_internal field with a layer of
indirection. On allocation, ->t_dfops points to the internal dfops.
This state is overridden by callers who re-init a local dfops on the
transaction. Once ->t_dfops is overridden, the external dfops
reference is maintained as the transaction rolls.

This patch adds the fundamental ability to support an internal
dfops. All codepaths that perform deferred processing continue to
override the internal dfops until they are converted over in
subsequent patches.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 44fd2946 24-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: pack holes in xfs_defer_ops and xfs_trans

Both structures have holes due to member alignment. Move dop_low to
the end of xfs_defer ops to sanitize the cache line alignment and
move t_flags to save 8 bytes in xfs_trans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# bba59c5e 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: add firstblock field to xfs_trans

A firstblock var is typically allocated and initialized along with
xfs_defer_ops structures and passed around independent from the
associated transaction. To facilitate combining the two, add an
optional ->t_firstblock field to xfs_trans that can be used in place
of an on-stack variable.

The firstblock value follows the lifetime of the transaction, so
initialize it on allocation and when a transaction rolls.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 6aa67184 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: rename xfs_trans ->t_agfl_dfops to ->t_dfops

The ->t_agfl_dfops field is currently used to defer agfl block frees
from associated transaction contexts. While all known problematic
contexts have already been updated to use ->t_agfl_dfops, the
broader goal is defer agfl frees from all callers that already use a
deferred operations structure. Further, the transaction field
facilitates a good amount of code clean up where the transaction and
dfops have historically been passed down through the stack
separately.

Rename the field to something more generic to prepare to use it as
such throughout XFS. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0b61f8a4 05-Jun-2018 Dave Chinner <dchinner@redhat.com>

xfs: convert to SPDX license tags

Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
echo $f
cat $f | awk -f hdr.awk > $f.new
mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
hdr = 1.0
tag = "GPL-2.0"
str = ""
}

/^ \* This program is free software/ {
hdr = 2.0;
next
}

/any later version./ {
tag = "GPL-2.0+"
next
}

/^ \*\// {
if (hdr > 0.0) {
print "// SPDX-License-Identifier: " tag
print str
print $0
str=""
hdr = 0.0
next
}
print $0
next
}

/^ \* / {
if (hdr > 1.0)
next
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
next
}

/^ \*/ {
if (hdr > 0.0)
next
print $0
next
}

// {
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# fcb762f5 09-May-2018 Brian Foster <bfoster@redhat.com>

xfs: add bmapi nodiscard flag

Freed extents are unconditionally discarded when online discard is
enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
discards when unnecessary. For example, this will be useful for
eofblocks trimming.

This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e6631f85 09-May-2018 Dave Chinner <dchinner@redhat.com>

xfs: get rid of the log item descriptor

It's just a connector between a transaction and a log item. There's
a 1:1 relationship between a log item descriptor and a log item,
and a 1:1 relationship between a log item descriptor and a
transaction. Both relationships are created and terminated at the
same time, so why do we even have the descriptor?

Replace it with a specific list_head in the log item and a new
log item dirtied flag to replace the XFS_LID_DIRTY flag.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix up deferred agfl intent finish_item use of LID_DIRTY]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 22525c17 09-May-2018 Dave Chinner <dchinner@redhat.com>

xfs: log item flags are racy

The log item flags contain a field that is protected by the AIL
lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to
set and clear these flags, but most of the updates and checks are
not done with the AIL lock held and so are susceptible to update
races.

Fix this by changing the log item flags to use atomic bitops rather
than be reliant on the AIL lock for update serialisation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f8f2835a 07-May-2018 Brian Foster <bfoster@redhat.com>

xfs: defer agfl block frees when dfops is available

The AGFL fixup code executes before every block allocation/free and
rectifies the AGFL based on the current, dynamic allocation
requirements of the fs. The AGFL must hold a minimum number of
blocks to satisfy a worst case split of the free space btrees caused
by the impending allocation operation. The AGFL is also updated to
maintain the implicit requirement for a minimum number of free slots
to satisfy a worst case join of the free space btrees.

Since the AGFL caches individual blocks, AGFL reduction typically
involves multiple, single block frees. We've had reports of
transaction overrun problems during certain workloads that boil down
to AGFL reduction freeing multiple blocks and consuming more space
in the log than was reserved for the transaction.

Since the objective of freeing AGFL blocks is to ensure free AGFL
free slots are available for the upcoming allocation, one way to
address this problem is to release surplus blocks from the AGFL
immediately but defer the free of those blocks (similar to how
file-mapped blocks are unmapped from the file in one transaction and
freed via a deferred operation) until the transaction is rolled.
This turns AGFL reduction into an operation with predictable log
reservation consumption.

Add the capability to defer AGFL block frees when a deferred ops
list is available to the AGFL fixup code. Add a dfops pointer to the
transaction to carry dfops through various contexts to the allocator
context. Deferring AGFL frees is conditional behavior based on
whether the transaction pointer is populated. The long term
objective is to reuse the transaction pointer to clean up all
unrelated callchains that pass dfops on the stack along with a
transaction and in doing so, consistently defer AGFL blocks from the
allocator.

A bit of customization is required to handle deferred completion
processing because AGFL blocks are accounted against a per-ag
reservation pool and AGFL blocks are not inserted into the extent
busy list when freed (they are inserted when used and released back
to the AGFL). Reuse the majority of the existing deferred extent
free infrastructure and customize it appropriately to handle AGFL
blocks.

Note that this patch only adds infrastructure. It does not change
behavior because no callers have been updated to pass ->t_agfl_dfops
into the allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 643c8c05 24-Jan-2018 Carlos Maiolino <cmaiolino@redhat.com>

Use list_head infra-structure for buffer's log items list

Now that buffer's b_fspriv has been split, just replace the current
singly linked list of xfs_log_items, by the list_head infrastructure.

Also, remove the xfs_log_item argument from xfs_buf_resubmit_failed_buffers(),
there is no need for this argument, once the log items can be walked
through the list_head in the buffer.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style cleanups]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a5814bce 29-Aug-2017 Brian Foster <bfoster@redhat.com>

xfs: disallow marking previously dirty buffers as ordered

Ordered buffers are used in situations where the buffer is not
physically logged but must pass through the transaction/logging
pipeline for a particular transaction. As a result, ordered buffers
are not unpinned and written back until the transaction commits to
the log. Ordered buffers have a strict requirement that the target
buffer must not be currently dirty and resident in the log pipeline
at the time it is marked ordered. If a dirty+ordered buffer is
committed, the buffer is reinserted to the AIL but not physically
relogged at the LSN of the associated checkpoint. The buffer log
item is assigned the LSN of the latest checkpoint and the AIL
effectively releases the previously logged buffer content from the
active log before the buffer has been written back. If the tail
pushes forward and a filesystem crash occurs while in this state, an
inconsistent filesystem could result.

It is currently the caller responsibility to ensure an ordered
buffer is not already dirty from a previous modification. This is
unclear and error prone when not used in situations where it is
guaranteed a buffer has not been previously modified (such as new
metadata allocations).

To facilitate general purpose use of ordered buffers, update
xfs_trans_ordered_buf() to conditionally order the buffer based on
state of the log item and return the status of the result. If the
bli is dirty, do not order the buffer and return false. The caller
must either physically log the buffer (having acquired the
appropriate log reservation) or push it from the AIL to clean it
before it can be marked ordered in the current transaction.

Note that ordered buffers are currently only used in two situations:
1.) inode chunk allocation where previously logged buffers are not
possible and 2.) extent swap which will be updated to handle ordered
buffer failures in a separate patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9684010d 29-Aug-2017 Brian Foster <bfoster@redhat.com>

xfs: refactor buffer logging into buffer dirtying helper

xfs_trans_log_buf() is responsible for logging the dirty segments of
a buffer along with setting all of the necessary state on the
transaction, buffer, bli, etc., to ensure that the associated items
are marked as dirty and prepared for I/O. We have a couple use cases
that need to to dirty a buffer in a transaction without actually
logging dirty ranges of the buffer. One existing use case is
ordered buffers, which are currently logged with arbitrary ranges to
accomplish this even though the content of ordered buffers is never
written to the log. Another pending use case is to relog an already
dirty buffer across rolled transactions within the deferred
operations infrastructure. This is required to prevent a held
(XFS_BLI_HOLD) buffer from pinning the tail of the log.

Refactor xfs_trans_log_buf() into a new function that contains all
of the logic responsible to dirty the transaction, lidp, buffer and
bli. This new function can be used in the future for the use cases
outlined above. This patch does not introduce functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 411350df 28-Aug-2017 Christoph Hellwig <hch@lst.de>

xfs: refactor xfs_trans_roll

Split xfs_trans_roll into a low-level helper that just rolls the
actual transaction and a new higher level xfs_trans_roll_inode
that takes care of logging and rejoining the inode. This gets
rid of the NULL inode case, and allows to simplify the special
cases in the deferred operation code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# d3a304b6 08-Aug-2017 Carlos Maiolino <cmaiolino@redhat.com>

xfs: Properly retry failed inode items in case of error during buffer writeback

When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0b80ae6e 08-Aug-2017 Carlos Maiolino <cmaiolino@redhat.com>

xfs: Add infrastructure needed for error propagation during buffer IO failure

With the current code, XFS never re-submit a failed buffer for IO,
because the failed item in the buffer is kept in the flush locked state
forever.

To be able to resubmit an log item for IO, we need a way to mark an item
as failed, if, for any reason the buffer which the item belonged to
failed during writeback.

Add a new log item callback to be used after an IO completion failure
and make the needed clean ups.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c8ce540d 16-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove double-underscore integer types

This is a purely mechanical patch that removes the private
__{u,}int{8,16,32,64}_t typedefs in favor of using the system
{u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
the transformation and fix the resulting whitespace and indentation
errors:

s/typedef\t__uint8_t/typedef __uint8_t\t/g
s/typedef\t__uint/typedef __uint/g
s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
s/__uint8_t\t/__uint8_t\t\t/g
s/__uint/uint/g
s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
s/__int/int/g
/^typedef.*int[0-9]*_t;$/d

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f990fc5a 14-Jun-2017 Shan Hai <shan.hai@oracle.com>

xfs: remove lsn relevant fields from xfs_trans structure and its users

The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
storage for the checkpoint sequence number only in the current code.

And the start/commit lsn are tracked as a transaction group tag in
the xfs_cil_ctx instead of a single transaction, so remove them from
the xfs_trans structure and their users to match with the design.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e1a4e37c 14-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent

In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Note that since bunmapi can fail to unmap the entire range, we must also
teach the deferred unmap code to roll into a new transaction whenever we
get low on reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 254133f5 06-Apr-2017 Christoph Hellwig <hch@lst.de>

xfs: fold __xfs_trans_roll into xfs_trans_roll

No one cares about the low-level helper anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e89c0413 28-Mar-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: implement the GETFSMAP ioctl

Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 64f61ab6 28-Jan-2017 Eric Sandeen <sandeen@redhat.com>

xfs: remove unused struct declarations

After scratching my head looking for "xfs_busy_extent" I realized
it's not used; it's xfs_extent_busy, and the declaration for the
other name is bogus. Remove that and a few others as well.

(struct xfs_log_callback is used, but the 2nd declaration is
unnecessary).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9f3afb57 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: implement deferred bmbt map/unmap operations

Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 77d61fe4 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: log bmap intent items

Provide a mechanism for higher levels to create BUI/BUD items, submit
them to the log, and a stub function to deal with recovered BUI items.
These parts will be connected to the rmapbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 33ba6129 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: connect refcount adjust functions to upper layers

Plumb in the upper level interface to schedule and finish deferred
refcount operations via the deferred ops mechanism.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f997ee21 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: log refcount intent items

Provide a mechanism for higher levels to create CUI/CUD items, submit
them to the log, and a stub function to deal with recovered CUI items.
These parts will be connected to the refcountbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 722e2517 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove the extents array from the rmap update done log item

Nothing ever uses the extent array in the rmap update done redo
item, so remove it before it is fixed in the on-disk log format.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9c194644 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: propagate bmap updates to rmapbt

When we map, unmap, or convert an extent in a file's data or attr
fork, schedule a respective update in the rmapbt. Previous versions
of this patch required a 1:1 correspondence between bmap and rmap,
but this is no longer true as we now have ability to make interval
queries against the rmapbt.

We use the deferred operations code to handle redo operations
atomically and deadlock free. This plumbs in all five rmap actions
(map, unmap, convert extent, alloc, free); we'll use the first three
now for file data, and reflink will want the last two. We also add
an error injection site to test log recovery.

Finally, we need to fix the bmap shift extent code to adjust the
rmaps correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# f8dbebef 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: enable the xfs_defer mechanism to process rmaps to update

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred rmap updates. We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9e88b5d8 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: log rmap intent items

Provide a mechanism for higher levels to create RUI/RUD items, submit
them to the log, and a stub function to deal with recovered RUI items.
These parts will be connected to the rmapbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 340785cc 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: add owner field to extent allocation and freeing

For the rmap btree to work, we have to feed the extent owner
information to the the allocation and freeing functions. This
information is what will end up in the rmap btree that tracks
allocated extents. While we technically don't need the owner
information when freeing extents, passing it allows us to validate
that the extent we are removing from the rmap btree actually
belonged to the owner we expected it to belong to.

We also define a special set of owner values for internal metadata
that would otherwise have no owner. This allows us to tell the
difference between metadata owned by different per-ag btrees, as
well as static fs metadata (e.g. AG headers) and internal journal
blocks.

There are also a couple of special cases we need to take care of -
during EFI recovery, we don't actually know who the original owner
was, so we need to pass a wildcard to indicate that we aren't
checking the owner for validity. We also need special handling in
growfs, as we "free" the space in the last AG when extending it, but
because it's new space it has no actual owner...

While touching the xfs_bmap_add_free() function, re-order the
parameters to put the struct xfs_mount first.

Extend the owner field to include both the owner type and some sort
of index within the owner. The index field will be used to support
reverse mappings when reflink is enabled.

When we're freeing extents from an EFI, we don't have the owner
information available (rmap updates have their own redo items).
xfs_free_extent therefore doesn't need to do an rmap update. Make
sure that the log replay code signals this correctly.

This is based upon a patch originally from Dave Chinner. It has been
extended to add more owner information with the intent of helping
recovery operations when things go wrong (e.g. offset of user data
block in a file).

[dchinner: de-shout the xfs_rmap_*_owner helpers]
[darrick: minor style fixes suggested by Christoph Hellwig]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 310a75a3 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_*

Drop the compatibility shims that we were using to integrate the new
deferred operation mechanism into the existing code. No new code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9749fee8 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: enable the xfs_defer mechanism to process extents to free

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred extent freeing. We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# bba61cbf 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: clean up typedef usage in the EFI/EFD handling code

Replace structure typedefs with struct xfs_foo_* in the EFI/EFD
handling code in preparation to move it over to deferred ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# b1c5ebb2 21-Jul-2016 Dave Chinner <dchinner@redhat.com>

xfs: allocate log vector buffers outside CIL context lock

One of the problems we currently have with delayed logging is that
under serious memory pressure we can deadlock memory reclaim. THis
occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
inodes and issues a log force to unpin inodes that are dirty in the
CIL.

The CIL is pushed, but this will only occur once it gets the CIL
context lock to ensure that all committing transactions are complete
and no new transactions start being committed to the CIL while the
push switches to a new context.

The deadlock occurs when the CIL context lock is held by a
committing process that is doing memory allocation for log vector
buffers, and that allocation is then blocked on memory reclaim
making progress. Memory reclaim, however, is blocked waiting for
a log force to make progress, and so we effectively deadlock at this
point.

To solve this problem, we have to move the CIL log vector buffer
allocation outside of the context lock so that memory reclaim can
always make progress when it needs to force the log. The problem
with doing this is that a CIL push can take place while we are
determining if we need to allocate a new log vector buffer for
an item and hence the current log vector may go away without
warning. That means we canot rely on the existing log vector being
present when we finally grab the context lock and so we must have a
replacement buffer ready to go at all times.

To ensure this, introduce a "shadow log vector" buffer that is
always guaranteed to be present when we gain the CIL context lock
and format the item. This shadow buffer may or may not be used
during the formatting, but if the log item does not have an existing
log vector buffer or that buffer is too small for the new
modifications, we swap it for the new shadow buffer and format
the modifications into that new log vector buffer.

The result of this is that for any object we modify more than once
in a given CIL checkpoint, we double the memory required
to track dirty regions in the log. For single modifications then
we consume the shadow log vectorwe allocate on commit, and that gets
consumed by the checkpoint. However, if we make multiple
modifications, then the second transaction commit will allocate a
shadow log vector and hence we will end up with double the memory
usage as only one of the log vectors is consumed by the CIL
checkpoint. The remaining shadow vector will be freed when th elog
item is freed.

This can probably be optimised in future - access to the shadow log
vector is serialised by the object lock (as opposited to the active
log vector, which is controlled by the CIL context lock) and so we
can probably free shadow log vector from some objects when the log
item is marked clean on removal from the AIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 253f4911 05-Apr-2016 Christoph Hellwig <hch@lst.de>

xfs: better xfs_trans_alloc interface

Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.

While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode. The
guts of it will be removed in another patch.

[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# a7e5d03b 01-Mar-2016 Christoph Hellwig <hch@lst.de>

xfs: remove xfs_trans_get_block_res

Just use the t_blk_res field directly instead of obsfucating the reference
by a macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 6bc43af3 18-Aug-2015 Brian Foster <bfoster@redhat.com>

xfs: ensure EFD trans aborts on log recovery extent free failure

Log recovery attempts to free extents with leftover EFIs in the AIL
after initial processing. If the extent free fails (e.g., due to
unrelated fs corruption), the transaction is cancelled, though it
might not be dirtied at the time. If this is the case, the EFD does
not abort and thus does not release the EFI. This can lead to hangs
as the EFI pins the AIL.

Update xlog_recover_process_efi() to log the EFD in the transaction
before xfs_free_extent() errors are handled to ensure the
transaction is dirty, aborts the EFD and releases the EFI on error.
Since this is a requirement for EFD processing (and consistent with
xfs_bmap_finish()), update the EFD logging helper to do the extent
free and unconditionally log the EFD. This encodes the required EFD
logging behavior into the helper and reduces the likelihood of
errors down the road.

[dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
failure.]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# d43ac29b 18-Aug-2015 Brian Foster <bfoster@redhat.com>

xfs: return committed status from xfs_trans_roll()

Some callers need to make error handling decisions based on whether
the current transaction successfully committed or not. Rename
xfs_trans_roll(), add a new parameter and provide a wrapper to
preserve existing callers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 5e4b5386 18-Aug-2015 Brian Foster <bfoster@redhat.com>

xfs: disentagle EFI release from the extent count

Release of the EFI either occurs based on the reference count or the
extent count. The extent count used is either the count tracked in
the EFI or EFD, depending on the particular situation. In either
case, the count is initialized to the final value and thus always
matches the current efi_next_extent value once the EFI is completely
constructed. For example, the EFI extent count is increased as the
extents are logged in xfs_bmap_finish() and the full free list is
always completely processed. Therefore, the count is guaranteed to
be complete once the EFI transaction is committed. The EFD uses the
efd_nextents counter to release the EFI. This counter is initialized
to the count of the EFI when the EFD is created. Thus the EFD, as
currently used, has no concept of partial EFI release based on
extent count.

Given that the EFI extent count is always released in whole, use of
the extent count for reference counting is unnecessary. Remove this
level of the API and release the EFI based on the core reference
count. The efi_next_extent counter remains because it is still used
to track the slot to log the next extent to free.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 70393313 03-Jun-2015 Christoph Hellwig <hch@lst.de>

xfs: saner xfs_trans_commit interface

The flags argument to xfs_trans_commit is not useful for most callers, as
a commit of a transaction without a permanent log reservation must pass
0 here, and all callers for a transaction with a permanent log reservation
except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove
the flags argument from the public xfs_trans_commit interfaces, and
introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
that regrants a log reservation instead of releasing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 4906e215 03-Jun-2015 Christoph Hellwig <hch@lst.de>

xfs: remove the flags argument to xfs_trans_cancel

xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
XFS_TRANS_ABORT. Both of them are a direct product of the transaction
state, and can be deducted:

- any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
- any transaction with a permanent log reservation needs
XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
log reservation is invalid.

So just remove the flags argument and do the right thing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 2e6db6c4 03-Jun-2015 Christoph Hellwig <hch@lst.de>

xfs: switch remaining xfs_trans_dup users to xfs_trans_roll

We have three remaining callers of xfs_trans_dup:

- xfs_itruncate_extents which open codes xfs_trans_roll
- xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves
attaching them to it's callers, but otherwise is identical to
xfs_trans_roll
- xfs_dir_ialloc looks at the log reservations in the old xfs_trans
structure instead of the log reservation parameters, but otherwise
is identical to xfs_trans_roll.

By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch
these three remaining users over to xfs_trans_roll and mark xfs_trans_dup
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# bde7cff6 12-Dec-2013 Christoph Hellwig <hch@infradead.org>

xfs: format log items write directly into the linear CIL buffer

Instead of setting up pointers to memory locations in iop_format which then
get copied into the CIL linear buffer after return move the copy into
the individual inode items. This avoids the need to always have a memory
block in the exact same layout that gets written into the log around, and
allow the log items to be much more flexible in their in-memory layouts.

The only caveat is that we need to properly align the data for each
iovec so that don't have structures misaligned in subsequent iovecs.

Note that all log item format routines now need to be careful to modify
the copy of the item that was placed into the CIL after calls to
xlog_copy_iovec instead of the in-memory copy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# a4fbe6ab 22-Oct-2013 Dave Chinner <dchinner@redhat.com>

xfs: decouple inode and bmap btree header files

Currently the xfs_inode.h header has a dependency on the definition
of the BMAP btree records as the inode fork includes an array of
xfs_bmbt_rec_host_t objects in it's definition.

Move all the btree format definitions from xfs_btree.h,
xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
xfs_format.h to continue the process of centralising the on-disk
format definitions. With this done, the xfs inode definitions are no
longer dependent on btree header files.

The enables a massive culling of unnecessary includes, with close to
200 #include directives removed from the XFS kernel code base.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 239880ef 22-Oct-2013 Dave Chinner <dchinner@redhat.com>

xfs: decouple log and transaction headers

xfs_trans.h has a dependency on xfs_log.h for a couple of
structures. Most code that does transactions doesn't need to know
anything about the log, but this dependency means that they have to
include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
files and clean up the includes to be in dependency order.

In doing this, remove the direct include of xfs_trans_reserve.h from
xfs_trans.h so that we remove the dependency between xfs_trans.h and
xfs_mount.h. Hence the xfs_trans.h include can be moved to the
indicate the actual dependencies other header files have on it.

Note that these are kernel only header files, so this does not
translate to any userspace changes at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d420e5c8 14-Oct-2013 Dave Chinner <dchinner@redhat.com>

xfs: remove unused transaction callback variables

We don't do callbacks at transaction commit time, no do we have any
infrastructure to set up or run such callbacks, so remove the
variables and typedefs for these operations. If we ever need to add
callbacks, we can reintroduce the variables at that time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 904c17e6 28-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: finish removing IOP_* macros.

In optimising the CIL operations, some of the IOP_* macros for
calling log item operations were removed. Remove the rest of them as
Christoph requested.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# f5baac35 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: avoid CIL allocation during insert

Now that we have the size of the log vector that has been allocated,
we can determine if we need to allocate a new log vector for
formatting and insertion. We only need to allocate a new vector if
it won't fit into the existing buffer.

However, we need to hold the CIL context lock while we do this so
that we can't race with a push draining the currently queued log
vectors. It is safe to do this as long as we do GFP_NOFS allocation
to avoid avoid memory allocation recursing into the filesystem.
Hence we can safely overwrite the existing log vector on the CIL if
it is large enough to hold all the dirty regions of the current
item.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 7492c5b4 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: Reduce allocations during CIL insertion

Now that we have the size of the object before the formatting pass
is called, we can allocation the log vector and it's buffer in a
single allocation rather than two separate allocations.

Store the size of the allocated buffer in the log vector so that
we potentially avoid allocation for future modifications of the
object.

While touching this code, remove the IOP_FORMAT definition.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 166d1368 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: return log item size in IOP_SIZE

To begin optimising the CIL commit process, we need to have IOP_SIZE
return both the number of vectors and the size of the data pointed
to by the vectors. This enables us to calculate the size ofthe
memory allocation needed before the formatting step and reduces the
number of memory allocations per item by one.

While there, kill the IOP_SIZE macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 3d3c8b52 12-Aug-2013 Jie Liu <jeff.liu@oracle.com>

xfs: refactor xfs_trans_reserve() interface

With the new xfs_trans_res structure has been introduced, the log
reservation size, log count as well as log flags are pre-initialized
at mount time. So it's time to refine xfs_trans_reserve() interface
to be more neat.

Also, introduce a new helper M_RES() to return a pointer to the
mp->m_resv structure to simplify the input.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 7fd36c44 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: split out transaction reservation code

The transaction reservation size calculations is used by both kernel
and userspace, but most of the transaction code in xfs_trans.c is
kernel specific. Split all the transaction reservation code out into
it's own files to make sharing with userspace simpler. This just
leaves kernel-only definitions in xfs_trans.h, so it doesn't need to
be shared with userspace anymore, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d386b32b 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: sync minor header differences needed by userspace.

Little things like exported functions, __KERNEL__ protections, and
so on that ensure user and kernel shared headers are identical.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 2a3c0acc 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: split out on-disk transaction definitions

There's a bunch of definitions in xfs_trans.h that define on-disk
formats - transaction headers that get written into the log, log
item type definitions, etc. Split out everything into a separate
file so that all which remains in xfs_trans.h are kernel only
definitions.

Also, remove the duplicate magic number definitions for
XFS_TRANS_MAGIC...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 3ebe7d2d 27-Jun-2013 Dave Chinner <david@fromorbit.com>

xfs: Inode create log items

Introduce the inode create log item type for logical inode create logging.
Instead of logging the changes in buffers, pass the range to be
initialised through the log by a new transaction type. This reduces
the amount of log space required to record initialisation during
allocation from about 128 bytes per inode to a small fixed amount
per inode extent to be initialised.

This requires a new log item type to track it through the log
and the AIL. This is a relatively simple item - most callbacks are
noops as this item has the same life cycle as the transaction.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 5f6bed76 27-Jun-2013 Dave Chinner <david@fromorbit.com>

xfs: Introduce an ordered buffer item

If we have a buffer that we have modified but we do not wish to
physically log in a transaction (e.g. we've logged a logical
change), we still need to ensure that transactional integrity is
maintained. Hence we must not move the tail of the log past the
transaction that the buffer is associated with before the buffer is
written to disk.

This means these special buffers still need to be included in the
transaction and added to the AIL just like a normal buffer, but we
do not want the modifications to the buffer written into the
transaction. IOWs, what we want is an "ordered buffer" that
maintains the same transactional life cycle as a physically logged
buffer, just without the transcribing of the modifications to the
log.

Hence we need to flag the buffer as an "ordered buffer" to avoid
including it in vector size calculations or formatting during the
transaction. Once the transaction is committed, the buffer appears
for all intents to be the same as a physically logged buffer as it
transitions through the log and AIL.

Relogging will also work just fine for such an ordered buffer - the
logical transaction will be replayed before the subsequent
modifications that relog the buffer, so everything will be
reconstructed correctly by recovery.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 2fb8b502 03-May-2013 Jie Liu <jeff.liu@oracle.com>

xfs: Remove two dead transaction log reservaion macros

Upstream commit 5b292ae3a951a58e32119d73c7ac8f5bec7395a3
xfs: make use of xfs_calc_buf_res() in xfs_trans.c

Beginning from above commit, neither XFS_ALLOCFREE_LOG_RES() nor
XFS_DIROP_LOG_RES() is used by those routines for calculating
transaction space reservations, so it's safe to remove them now.

Also, with a slightly update for the relevant comments to reflect
the ideas of why those log count numbers should be.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 742ae1e3 30-Apr-2013 Dave Chinner <dchinner@redhat.com>

xfs: introduce CONFIG_XFS_WARN

Running a CONFIG_XFS_DEBUG kernel in production environments is not
the best idea as it introduces significant overhead, can change
the behaviour of algorithms (such as allocation) to improve test
coverage, and (most importantly) panic the machine on non-fatal
errors.

There are many cases where all we want to do is run a
kernel with more bounds checking enabled, such as is provided by the
ASSERT() statements throughout the code, but without all the
potential overhead and drawbacks.

This patch converts all the ASSERT statements to evaluate as
WARN_ON(1) statements and hence if they fail dump a warning and a
stack trace to the log. This has minimal overhead and does not
change any algorithms, and will allow us to find strange "out of
bounds" problems more easily on production machines.

There are a few places where assert statements contain debug only
code. These are converted to be debug-or-warn only code so that we
still get all the assert checks in the code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 61fe135c 02-Apr-2013 Dave Chinner <dchinner@redhat.com>

xfs: buffer type overruns blf_flags field

The buffer type passed to log recvoery in the buffer log item
overruns the blf_flags field. I had assumed that flags field was a
32 bit value, and it turns out it is a unisgned short. Therefore
having 19 flags doesn't really work.

Convert the buffer type field to numeric value, and use the top 5
bits of the flags field for it. We currently have 17 types of
buffers, so using 5 bits gives us plenty of room for expansion in
future....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d75afeb3 02-Apr-2013 Dave Chinner <dchinner@redhat.com>

xfs: add buffer types to directory and attribute buffers

Add buffer types to the buffer log items so that log recovery can
validate the buffers and calculate CRCs correctly after the buffers
are recovered.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# ee1a47ab 21-Apr-2013 Christoph Hellwig <hch@lst.de>

xfs: add support for large btree blocks

Add support for larger btree blocks that contains a CRC32C checksum,
a filesystem uuid and block number for detecting filesystem
consistency and out of place writes.

[dchinner@redhat.com] Also include an owner field to allow reverse
mappings to be implemented for improved repairability and a LSN
field to so that log recovery can easily determine the last
modification that made it to disk for each buffer.

[dchinner@redhat.com] Add buffer log format flags to indicate the
type of buffer to recovery so that we don't have to do blind magic
number tests to determine what the buffer is.

[dchinner@redhat.com] Modified to fit into the verifier structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# a21cd503 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: refactor space log reservation for XFS_TRANS_ATTR_SET

Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:

1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.

2) ((ext * (mp)->m_sb.sb_sectsize) + \
(ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
(128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.

This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.

In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# a7bd794a 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk

Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 762d7ba6 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time

Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# a1bd9557 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time

Convert the calculation of quota off transaction log space reservation
from runtime to mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 48001044 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time

The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# f0f2df94 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time

For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# b0c10b98 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time

The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 1813dd64 13-Nov-2012 Dave Chinner <dchinner@redhat.com>

xfs: convert buffer verifiers to an ops structure.

To separate the verifiers from iodone functions and associate read
and write verifiers at the same time, introduce a buffer verifier
operations structure to the xfs_buf.

This avoids the need for assigning the write verifier, clearing the
iodone function and re-running ioend processing in the read
verifier, and gets rid of the nasty "b_pre_io" name for the write
verifier function pointer. If we ever need to, it will also be
easier to add further content specific callbacks to a buffer with an
ops structure in place.

We also avoid needing to export verifier functions, instead we
can simply export the ops structures for those that are needed
outside the function they are defined in.

This patch also fixes a directory block readahead verifier issue
it exposed.

This patch also adds ops callbacks to the inode/alloc btree blocks
initialised by growfs. These will need more work before they will
work with CRCs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# c3f8fc73 12-Nov-2012 Dave Chinner <dchinner@redhat.com>

xfs: make buffer read verication an IO completion function

Add a verifier function callback capability to the buffer read
interfaces. This will be used by the callers to supply a function
that verifies the contents of the buffer when it is read from disk.
This patch does not provide callback functions, but simply modifies
the interfaces to allow them to be called.

The reason for adding this to the read interfaces is that it is very
difficult to tell fom the outside is a buffer was just read from
disk or whether we just pulled it out of cache. Supplying a callbck
allows the buffer cache to use it's internal knowledge of the buffer
to execute it only when the buffer is read from disk.

It is intended that the verifier functions will mark the buffer with
an EFSCORRUPTED error when verification fails. This allows the
reading context to distinguish a verification error from an IO
error, and potentially take further actions on the buffer (e.g.
attempt repair) based on the error reported.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d9457dc0 12-Jun-2012 Jan Kara <jack@suse.cz>

xfs: Convert to new freezing code

Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# de2a4f59 22-Jun-2012 Dave Chinner <dchinner@redhat.com>

xfs: add discontiguous buffer support to transactions

Now that the buffer cache supports discontiguous buffers, add
support to the transaction buffer interface for getting and reading
buffers.

Note that this patch does not convert the buffer item logging to
support discontiguous buffers. That will be done as a separate
commit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 77ba7877 02-Apr-2012 Al Viro <viro@zeniv.linux.org.uk>

xfs: switch to proper __bitwise type for KM_... flags

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 43ff2122 22-Apr-2012 Christoph Hellwig <hch@infradead.org>

xfs: on-stack delayed write buffer lists

Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
and write back the buffers per-process instead of by waking up xfsbufd.

This is now easily doable given that we have very few places left that write
delwri buffers:

- log recovery:
Only done at mount time, and already forcing out the buffers
synchronously using xfs_flush_buftarg

- quotacheck:
Same story.

- dquot reclaim:
Writes out dirty dquots on the LRU under memory pressure. We might
want to look into doing more of this via xfsaild, but it's already
more optimal than the synchronous inode reclaim that writes each
buffer synchronously.

- xfsaild:
This is the main beneficiary of the change. By keeping a local list
of buffers to write we reduce latency of writing out buffers, and
more importably we can remove all the delwri list promotions which
were hitting the buffer cache hard under sustained metadata loads.

The implementation is very straight forward - xfs_buf_delwri_queue now gets
a new list_head pointer that it adds the delwri buffers to, and all callers
need to eventually submit the list using xfs_buf_delwi_submit or
xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are
skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
list. The biggest change to pass down the buffer list was done to the AIL
pushing. Now that we operate on buffers the trylock, push and pushbuf log
item methods are merged into a single push routine, which tries to lock the
item, and if possible add the buffer that needs writeback to the buffer list.
This leads to much simpler code than the previous split but requires the
individual IOP_PUSH instances to unlock and reacquire the AIL around calls
to blocking routines.

Given that xfsailds now also handle writing out buffers, the conditions for
log forcing and the sleep times needed some small changes. The most
important one is that we consider an AIL busy as long we still have buffers
to push, and the other one is that we do increment the pushed LSN for
buffers that are under flushing at this moment, but still count them towards
the stuck items for restart purposes. Without this we could hammer on stuck
items without ever forcing the log and not make progress under heavy random
delete workloads on fast flash storage devices.

[ Dave Chinner:
- rebase on previous patches.
- improved comments for XBF_DELWRI_Q handling
- fix XBF_ASYNC handling in queue submission (test 106 failure)
- rename delwri submit function buffer list parameters for clarity
- xfs_efd_item_push() should return XFS_ITEM_PINNED ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# b3934213 06-Dec-2011 Christoph Hellwig <hch@infradead.org>

xfs: remove the lid_size field in struct log_item_desc

Outside the now removed nodelaylog code this field is only used for
asserts and can be safely removed now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 272e42b2 28-Oct-2011 Christoph Hellwig <hch@infradead.org>

xfs: constify xfs_item_ops

The log item ops aren't nessecarily the biggest exploit vector, but marking
them const is easy enough. Also remove the unused xfs_item_ops_t typedef
while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>


# ddc3415a 19-Sep-2011 Christoph Hellwig <hch@infradead.org>

xfs: simplify xfs_trans_ijoin* again

There is no reason to keep a reference to the inode even if we unlock
it during transaction commit because we never drop a reference between
the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
back into xfs_trans_ijoin - the third argument decides if an unlock
is needed now.

I'm actually starting to wonder if allowing inodes to be unlocked
at transaction commit really is worth the effort. The only real
benefit is that they can be unlocked earlier when commiting a
synchronous transactions, but that could be solved by doing the
log force manually after the unlock, too.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# b1037058 19-Sep-2011 Christoph Hellwig <hch@infradead.org>

xfs: unlock the inode before log force in xfs_fsync

Only read the LSN we need to push to with the ilock held, and then release
it before we do the log force to improve concurrency.

This also removes the only direct caller of _xfs_trans_commit, thus
allowing it to be merged into the plain xfs_trans_commit again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 17b38471 11-Oct-2011 Christoph Hellwig <hch@infradead.org>

xfs: force the log if we encounter pinned buffers in .iop_pushbuf

We need to check for pinned buffers even in .iop_pushbuf given that inode
items flush into the same buffers that may be pinned directly due operations
on the unlinked inode list operating directly on buffers. To do this add a
return value to .iop_pushbuf that tells the AIL push about this and use
the existing log force mechanisms to unpin it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# b2ce3974 11-Jul-2011 Alex Elder <aelder@sgi.com>

Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"

This reverts commit 7a249cf83da1813cfa71cfe1e265b40045eceb47.

That commit created a situation that could lead to a filesystem
hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a
reference to m_active_trans (i.e., keep it non-zero) and then wait
for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze
request could set SB_FREEZE_TRANS and then wait for m_active_trans
to drop to zero. Nobody benefits from this sequence of events...

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 7a249cf8 08-Jul-2011 Christoph Hellwig <hch@lst.de>

xfs: fix filesystsem freeze race in xfs_trans_alloc

As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem
freeze when it sleeps during the memory allocation. Fix this by moving the
wait_for_freeze call after the memory allocation. This means moving the
freeze into the low-level _xfs_trans_alloc helper, which thus grows a new
argument. Also fix up some comments in that area while at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>


# ec3ba85f 13-Feb-2011 Christoph Hellwig <hch@infradead.org>

xfs: more sensible inode refcounting for ialloc

Currently we return iodes from xfs_ialloc with just a single reference held.
But we need two references, as one is dropped during transaction commit and
the second needs to be transfered to the VFS. Change xfs_ialloc to use
xfs_iget plus xfs_trans_ijoin_ref to grab two references to the inode,
and remove the now superflous IHOLD calls from all callers. This also
greatly simplifies the error handling in xfs_create and also allow to remove
xfs_trans_iget as no other callers are left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 821eb21d 01-Dec-2010 Dave Chinner <dchinner@redhat.com>

xfs: connect up buffer reclaim priority hooks

Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# dfe188d4 06-Oct-2010 Christoph Hellwig <hch@infradead.org>

xfs: remove unused t_callback field in struct xfs_trans

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# dcd79a14 27-Sep-2010 Dave Chinner <dchinner@redhat.com>

xfs: don't use vfs writeback for pure metadata modifications

Under heavy multi-way parallel create workloads, the VFS struggles
to write back all the inodes that have been changed in age order.
The bdi flusher thread becomes CPU bound, spending 85% of it's time
in the VFS code, mostly traversing the superblock dirty inode list
to separate dirty inodes old enough to flush.

We already keep an index of all metadata changes in age order - in
the AIL - and continued log pressure will do age ordered writeback
without any extra overhead at all. If there is no pressure on the
log, the xfssyncd will periodically write back metadata in ascending
disk address offset order so will be very efficient.

Hence we can stop marking VFS inodes dirty during transaction commit
or when changing timestamps during transactions. This will keep the
inodes in the superblock dirty list to those containing data or
unlogged metadata changes.

However, the timstamp changes are slightly more complex than this -
there are a couple of places that do unlogged updates of the
timestamps, and the VFS need to be informed of these. Hence add a
new function xfs_trans_ichgtime() for transactional changes,
and leave xfs_ichgtime() for the non-transactional changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# a59f5570 23-Jun-2010 Christoph Hellwig <hch@infradead.org>

xfs: remove the unused XFS_TRANS_NOSLEEP/XFS_TRANS_WAIT flags

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 898621d5 23-Jun-2010 Christoph Hellwig <hch@infradead.org>

xfs: simplify inode to transaction joining

Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
joining it to a transaction via xfs_trans_ijoin.

This patches instead makes xfs_trans_ijoin usable on it's own by doing
an implicity xfs_trans_ihold, which also allows us to drop the third
argument. For the case where we want to hold a reference on the inode
a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
the inode for needing an xfs_iput. In addition to the cleaner interface
to the caller this also simplifies the implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 9412e318 23-Jun-2010 Christoph Hellwig <hch@infradead.org>

xfs: merge iop_unpin_remove into iop_unpin

The unpin_remove item operation instances always share most of the
implementation with the respective unpin implementation. So instead
of keeping two different entry points add a remove flag to the unpin
operation and share the code more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# e98c414f 23-Jun-2010 Christoph Hellwig <hch@infradead.org>

xfs: simplify log item descriptor tracking

Currently we track log item descriptor belonging to a transaction using a
complex opencoded chunk allocator. This code has been there since day one
and seems to work around the lack of an efficient slab allocator.

This patch replaces it with dynamically allocated log item descriptors
from a dedicated slab pool, linked to the transaction by a linked list.

This allows to greatly simplify the log item descriptor tracking to the
point where it's just a couple hundred lines in xfs_trans.c instead of
a separate file. The external API has also been simplified while we're
at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
delete items from a transaction have been simplified to the bare minium,
and the xfs_trans_find_item function is replaced with a direct dereference
of the li_desc field. All debug code walking the list of log items in
a transaction is down to a simple list_for_each_entry.

Note that we could easily use a singly linked list here instead of the
double linked list from list.h as the fastpath only does deletion from
sequential traversal. But given that we don't have one available as
a library function yet I use the list.h functions for simplicity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 025101dc 04-May-2010 Christoph Hellwig <hch@infradead.org>

xfs: cleanup log reservation calculactions

Instead of having small helper functions calling big macros do the
calculations for the log reservations directly in the functions.
These are mostly 1:1 from the macros execept that the macros kept
the quota calculations in their callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# ccf7c23f 20-May-2010 Dave Chinner <dchinner@redhat.com>

xfs: Ensure inode allocation buffers are fully replayed

With delayed logging, we can get inode allocation buffers in the
same transaction inode unlink buffers. We don't currently mark inode
allocation buffers in the log, so inode unlink buffers take
precedence over allocation buffers.

The result is that when they are combined into the same checkpoint,
only the unlinked inode chain fields are replayed, resulting in
uninitialised inode buffers being detected when the next inode
modification is replayed.

To fix this, we need to ensure that we do not set the inode buffer
flag in the buffer log item format flags if the inode allocation has
not already hit the log. To avoid requiring a change to log
recovery, we really need to make this a modification that relies
only on in-memory sate.

We can do this by checking during buffer log formatting (while the
CIL cannot be flushed) if we are still in the same sequence when we
commit the unlink transaction as the inode allocation transaction.
If we are, then we do not add the inode buffer flag to the buffer
log format item flags. This means the entire buffer will be
replayed, not just the unlinked fields. We do this while
CIL flusheѕ are locked out to ensure that we don't race with the
sequence numbers changing and hence fail to put the inode buffer
flag in the buffer format flags when we really need to.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 71e330b5 20-May-2010 Dave Chinner <dchinner@redhat.com>

xfs: Introduce delayed logging core code

The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.

We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.

The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.

To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion. This includes adding a log ticket for tracking space
log required/used by the context checkpoint.

To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out. This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.

Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.

This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.

The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.

Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.

Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.

Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.

We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.

To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# ed3b4d6c 20-May-2010 Dave Chinner <david@fromorbit.com>

xfs: Improve scalability of busy extent tracking

When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents. This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 9abbc539 12-Apr-2010 Dave Chinner <dchinner@redhat.com>

xfs: add log item recovery tracing

Currently there is no tracing in log recovery, so it is difficult to
determine what is going on when something goes wrong.

Add tracing for log item recovery to provide visibility into the log
recovery process. The tracing added shows regions being extracted
from the log transactions and added to the transaction hash forming
recovery items, followed by the reordering, cancelling and finally
recovery of the items.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 8e123850 07-Mar-2010 Dave Chinner <dchinner@redhat.com>

xfs: remove stale parameter from ->iop_unpin method

The staleness of a object being unpinned can be directly derived
from the object itself - there is no need to extract it from the
object then pass it as a parameter into IOP_UNPIN().

This means we can kill the XFS_LID_BUF_STALE flag - it is set,
checked and cleared in the same places XFS_BLI_STALE flag in the
xfs_buf_log_item so it is now redundant and hence safe to remove.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 35a8a72f 15-Feb-2010 Christoph Hellwig <hch@infradead.org>

xfs: stop passing opaque handles to xfs_log.c routines

Currenly we pass opaque xfs_log_ticket_t handles instead of
struct xlog_ticket pointers, and void pointers instead of
struct xlog_in_core pointers to various log manager functions.
Instead pass properly typed pointers after adding forward
declarations for them to xfs_log.h, and adjust the touched
function prototypes to the standard XFS style while at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# d808f617 01-Feb-2010 Dave Chinner <david@fromorbit.com>

xfs: Don't issue buffer IO direct from AIL push V2

All buffers logged into the AIL are marked as delayed write.
When the AIL needs to push the buffer out, it issues an async write of the
buffer. This means that IO patterns are dependent on the order of
buffers in the AIL.

Instead of flushing the buffer, promote the buffer in the delayed
write list so that the next time the xfsbufd is run the buffer will
be flushed by the xfsbufd. Return the state to the xfsaild that the
buffer was promoted so that the xfsaild knows that it needs to cause
the xfsbufd to run to flush the buffers that were promoted.

Using the xfsbufd for issuing the IO allows us to dispatch all
buffer IO from the one queue. This means that we can make much more
enlightened decisions on what order to flush buffers to disk as
we don't have multiple places issuing IO. Optimisations to xfsbufd
will be in a future patch.

Version 2
- kill XFS_ITEM_FLUSHING as it is now unused.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 0b1b213f 14-Dec-2009 Christoph Hellwig <hch@infradead.org>

xfs: event tracing support

Convert the old xfs tracing support that could only be used with the
out of tree kdb and xfsidbg patches to use the generic event tracer.

To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
all xfs trace channels by:

echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

or alternatively enable single events by just doing the same in one
event subdirectory, e.g.

echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

or set more complex filters, etc. In Documentation/trace/events.txt
all this is desctribed in more detail. To reads the events do a

cat /sys/kernel/debug/tracing/trace

Compared to the last posting this patch converts the tracing mostly to
the one tracepoint per callsite model that other users of the new
tracing facility also employ. This allows a very fine-grained control
of the tracing, a cleaner output of the traces and also enables the
perf tool to use each tracepoint as a virtual performance counter,
allowing us to e.g. count how often certain workloads git various
spots in XFS. Take a look at

http://lwn.net/Articles/346470/

for some examples.

Also the btree tracing isn't included at all yet, as it will require
additional core tracing features not in mainline yet, I plan to
deliver it later.

And the really nice thing about this patch is that it actually removes
many lines of code while adding this nice functionality:

fs/xfs/Makefile | 8
fs/xfs/linux-2.6/xfs_acl.c | 1
fs/xfs/linux-2.6/xfs_aops.c | 52 -
fs/xfs/linux-2.6/xfs_aops.h | 2
fs/xfs/linux-2.6/xfs_buf.c | 117 +--
fs/xfs/linux-2.6/xfs_buf.h | 33
fs/xfs/linux-2.6/xfs_fs_subr.c | 3
fs/xfs/linux-2.6/xfs_ioctl.c | 1
fs/xfs/linux-2.6/xfs_ioctl32.c | 1
fs/xfs/linux-2.6/xfs_iops.c | 1
fs/xfs/linux-2.6/xfs_linux.h | 1
fs/xfs/linux-2.6/xfs_lrw.c | 87 --
fs/xfs/linux-2.6/xfs_lrw.h | 45 -
fs/xfs/linux-2.6/xfs_super.c | 104 ---
fs/xfs/linux-2.6/xfs_super.h | 7
fs/xfs/linux-2.6/xfs_sync.c | 1
fs/xfs/linux-2.6/xfs_trace.c | 75 ++
fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
fs/xfs/linux-2.6/xfs_vnode.h | 4
fs/xfs/quota/xfs_dquot.c | 110 ---
fs/xfs/quota/xfs_dquot.h | 21
fs/xfs/quota/xfs_qm.c | 40 -
fs/xfs/quota/xfs_qm_syscalls.c | 4
fs/xfs/support/ktrace.c | 323 ---------
fs/xfs/support/ktrace.h | 85 --
fs/xfs/xfs.h | 16
fs/xfs/xfs_ag.h | 14
fs/xfs/xfs_alloc.c | 230 +-----
fs/xfs/xfs_alloc.h | 27
fs/xfs/xfs_alloc_btree.c | 1
fs/xfs/xfs_attr.c | 107 ---
fs/xfs/xfs_attr.h | 10
fs/xfs/xfs_attr_leaf.c | 14
fs/xfs/xfs_attr_sf.h | 40 -
fs/xfs/xfs_bmap.c | 507 +++------------
fs/xfs/xfs_bmap.h | 49 -
fs/xfs/xfs_bmap_btree.c | 6
fs/xfs/xfs_btree.c | 5
fs/xfs/xfs_btree_trace.h | 17
fs/xfs/xfs_buf_item.c | 87 --
fs/xfs/xfs_buf_item.h | 20
fs/xfs/xfs_da_btree.c | 3
fs/xfs/xfs_da_btree.h | 7
fs/xfs/xfs_dfrag.c | 2
fs/xfs/xfs_dir2.c | 8
fs/xfs/xfs_dir2_block.c | 20
fs/xfs/xfs_dir2_leaf.c | 21
fs/xfs/xfs_dir2_node.c | 27
fs/xfs/xfs_dir2_sf.c | 26
fs/xfs/xfs_dir2_trace.c | 216 ------
fs/xfs/xfs_dir2_trace.h | 72 --
fs/xfs/xfs_filestream.c | 8
fs/xfs/xfs_fsops.c | 2
fs/xfs/xfs_iget.c | 111 ---
fs/xfs/xfs_inode.c | 67 --
fs/xfs/xfs_inode.h | 76 --
fs/xfs/xfs_inode_item.c | 5
fs/xfs/xfs_iomap.c | 85 --
fs/xfs/xfs_iomap.h | 8
fs/xfs/xfs_log.c | 181 +----
fs/xfs/xfs_log_priv.h | 20
fs/xfs/xfs_log_recover.c | 1
fs/xfs/xfs_mount.c | 2
fs/xfs/xfs_quota.h | 8
fs/xfs/xfs_rename.c | 1
fs/xfs/xfs_rtalloc.c | 1
fs/xfs/xfs_rw.c | 3
fs/xfs/xfs_trans.h | 47 +
fs/xfs/xfs_trans_buf.c | 62 -
fs/xfs/xfs_vnodeops.c | 8
70 files changed, 2151 insertions(+), 2592 deletions(-)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 80641dc6 18-Oct-2009 Christoph Hellwig <hch@infradead.org>

xfs: I/O completion handlers must use NOFS allocations

When completing I/O requests we must not allow the memory allocator to
recurse into the filesystem, as we might deadlock on waiting for the
I/O completion otherwise. The only thing currently allocating normal
GFP_KERNEL memory is the allocation of the transaction structure for
the unwritten extent conversion. Add a memflags argument to
_xfs_trans_alloc to allow controlling the allocator behaviour.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Thomas Neumann <tneumann@users.sourceforge.net>
Tested-by: Thomas Neumann <tneumann@users.sourceforge.net>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 13e6d5cd 31-Aug-2009 Christoph Hellwig <hch@lst.de>

xfs: merge fsync and O_SYNC handling

The guarantees for O_SYNC are exactly the same as the ones we need to
make for an fsync call (and given that Linux O_SYNC is O_DSYNC the
equivalent is fdadatasync, but we treat both the same in XFS), except
with a range data writeout. Jan Kara has started unifying these two
path for filesystems using the generic helpers, and I've started to
look at XFS.

The actual transaction commited by xfs_fsync and xfs_write_sync_logforce
has a different transaction number, but actually is exactly the same.
We'll only use the fsync transaction going forward. One major difference
is that xfs_write_sync_logforce never issues a cache flush unless we
commit a transaction causing that as a side-effect, which is an obvious
bug in the O_SYNC handling. Second all the locking and i_update_size
vs i_update_core changes from 978b7237123d007b9fa983af6e0e2fa8f97f9934
never made it to xfs_write_sync_logforce, so we add them back.

To make xfs_fsync easily usable from the O_SYNC path, the filemap_fdatawait
call is moved up to xfs_file_fsync, so that we don't wait on the whole
file after we already waited for our portion in xfs_write.

We'll also use a plain call to filemap_write_and_wait_range instead
of the previous sync_page_rang which did it in two steps including
an half-hearted inode write out that doesn't help us.

Once we're done with this also remove the now useless i_update_size
tracking.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>


# 9da096fd 29-Mar-2009 Malcolm Parsons <malcolm.parsons@gmail.com>

xfs: fix various typos

Signed-off-by: Malcolm Parsons <malcolm.parsons@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 0d87e656 09-Feb-2009 Christoph Hellwig <hch@lst.de>

xfs: remove superflous inobt macros

xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code
but don't provide any abstraction benefits. This patches removes those
and cleans up the reamaining defintions up a little.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>


# 783a2f65 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] Finish removing the mount pointer from the AIL API

Change all the remaining AIL API functions that are passed struct
xfs_mount pointers to pass pointers directly to the struct xfs_ail being
used. With this conversion, all external access to the AIL is via the
struct xfs_ail. Hence the operation and referencing of the AIL is almost
entirely independent of the xfs_mount that is using it - it is now much
more tightly tied to the log and the items it is tracking in the log than
it is tied to the xfs_mount.

SGI-PV: 988143

SGI-Modid: xfs-linux-melb:xfs-kern:32353a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# fc1829f3 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] Add ail pointer into log items

Add an xfs_ail pointer to log items so that the log items can reference
the AIL directly during callbacks without needed a struct xfs_mount.

SGI-PV: 988143

SGI-Modid: xfs-linux-melb:xfs-kern:32352a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 5b00f14f 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] move the AIl traversal over to a consistent interface

With the new cursor interface, it makes sense to make all the traversing
code use the cursor interface and make the old one go away. This means
more of the AIL interfacing is done by passing struct xfs_ail pointers
around the place instead of struct xfs_mount pointers.

We can replace the use of xfs_trans_first_ail() in xfs_log_need_covered()
as it is only checking if the AIL is empty. We can do that with a call to
xfs_trans_ail_tail() instead, where a zero LSN returned indicates and
empty AIL...

SGI-PV: 988143

SGI-Modid: xfs-linux-melb:xfs-kern:32348a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 847fff5c 30-Oct-2008 Barry Naujok <bnaujok@sgi.com>

[XFS] Sync up kernel and user-space headers

SGI-PV: 986558

SGI-Modid: xfs-linux-melb:xfs-kern:32231a

Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 39dab9d7 13-Aug-2008 Eric Sandeen <sandeen@sandeen.net>

[XFS] remove shouting-indirection macros from xfs_trans.h

SGI-PV: 981498

SGI-Modid: xfs-linux-melb:xfs-kern:31758a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 322ff6b8 13-Aug-2008 Niv Sardi <xaiki@sgi.com>

[XFS] Move xfs_attr_rolltrans to xfs_trans_roll

Move it from the attr code to the transaction code and make
the attr code call the new function.

We rolltrans is really usefull whenever we want to use rolling
transaction, should be generic, it isn't dependent on any part
of the attr code anyway.

We use this excuse to change all the:

if ((error = xfs_attr_rolltrans()))

calls into:

error = xfs_trans_roll();

if (error)

SGI-PV: 981498

SGI-Modid: xfs-linux-melb:xfs-kern:31729a

Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 535f6b37 27-Mar-2008 Josef 'Jeff' Sipek <jeffpc@josefsipek.net>

[XFS] Replace custom AIL linked-list code with struct list_head

Replace the xfs_ail_entry_t with a struct list_head and clean the
surrounding code up. Also fixes a livelock in xfs_trans_first_push_ail()
by terminating the loop at the head of the list correctly.

SGI-PV: 978682
SGI-Modid: xfs-linux-melb:xfs-kern:30636a

Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# dfa18b11 05-Mar-2008 Niv Sardi <xaiki@sgi.com>

[XFS] kill t_sema member of struct xfs_trans

It's completely unused so we might aswell kill it. Note that there is
another t_sema in struct xlog_ticket, which is used and actually an sv_t
despite the name. That one is left untouched by this patch.

SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30591a

Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 249a8c11 04-Feb-2008 David Chinner <dgc@sgi.com>

[XFS] Move AIL pushing into it's own thread

When many hundreds to thousands of threads all try to do simultaneous
transactions and the log is in a tail-pushing situation (i.e. full), we
can get multiple threads walking the AIL list and contending on the AIL
lock.

The AIL push is, in effect, a simple I/O dispatch algorithm complicated by
the ordering constraints placed on it by the transaction subsystem. It
really does not need multiple threads to push on it - even when only a
single CPU is pushing the AIL, it can push the I/O out far faster that
pretty much any disk subsystem can handle.

So, to avoid contention problems stemming from multiple list walkers, move
the list walk off into another thread and simply provide a "target" to
push to. When a thread requires a push, it sets the target and wakes the
push thread, then goes to sleep waiting for the required amount of space
to become available in the log.

This mechanism should also be a lot fairer under heavy load as the waiters
will queue in arrival order, rather than queuing in "who completed a push
first" order.

Also, by moving the pushing to a separate thread we can do more
effectively overload detection and prevention as we can keep context from
loop iteration to loop iteration. That is, we can push only part of the
list each loop and not have to loop back to the start of the list every
time we run. This should also help by reducing the number of items we try
to lock and/or push items that we cannot move.

Note that this patch is not intended to solve the inefficiencies in the
AIL structure and the associated issues with extremely large list
contents. That needs to be addresses separately; parallel access would
cause problems to any new structure as well, so I'm only aiming to isolate
the structure from unbounded parallelism here.

SGI-PV: 972759
SGI-Modid: xfs-linux-melb:xfs-kern:30371a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# a8272ce0 22-Nov-2007 David Chinner <dgc@sgi.com>

[XFS] Fix up sparse warnings.

These are mostly locking annotations, marking things static, casts where
needed and declaring stuff in header files.

SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30002a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 92821e2b 23-May-2007 David Chinner <dgc@sgi.com>

[XFS] Lazy Superblock Counters

When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.

When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.

The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.

The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.

This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.

It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.

One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.

As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....

SGI-PV: 964999
SGI-Modid: xfs-linux-melb:xfs-kern:28652a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 1c72bf90 07-May-2007 Eric Sandeen <sandeen@sandeen.net>

[XFS] The last argument "lsn" of xfs_trans_commit() is always called with
NULL.

Patch provided by Eric Sandeen.

SGI-PV: 961693
SGI-Modid: xfs-linux-melb:xfs-kern:28199a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 20f4ebf2 10-Feb-2007 David Chinner <dgc@sgi.com>

[XFS] Make growfs work for amounts greater than 2TB

The free block modification code has a 32bit interface, limiting the size
the filesystem can be grown even on 64 bit machines. On 32 bit machines,
there are other 32bit variables in transaction structures and interfaces
that need to be expanded to allow this to work.

SGI-PV: 959978
SGI-Modid: xfs-linux-melb:xfs-kern:27894a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 804195b6 10-Feb-2007 Eric Sandeen <sandeen@sandeen.net>

[XFS] Get rid of old 5.3/6.1 v1 log items. Cleanup patch sent in by Eric
Sandeen.

SGI-PV: 958736
SGI-Modid: xfs-linux-melb:xfs-kern:27596a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 065d312e 27-Sep-2006 Eric Sandeen <sandeen@sandeen.net>

[XFS] Remove unused iop_abort log item operation

SGI-PV: 955302
SGI-Modid: xfs-linux-melb:xfs-kern:26747a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 1998764e 27-Jun-2006 Alexey Dobriyan <adobriyan@gmail.com>

[XFS] Reduce size of xfs_trans_t structure. * remove ->t_forw, ->t_back --
unused * ->t_ag_freeblks_delta, ->t_ag_flist_delta, ->t_ag_btree_delta
are debugging aid -- wrap them in everyone's favourite way. As a
result, cut "xfs_trans" slab object size from 592 to 572 bytes here.

SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26319a

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# f6c2d1fa 19-Jun-2006 Nathan Scott <nathans@sgi.com>

[XFS] Remove version 1 directory code. Never functioned on Linux, just
pure bloat.

SGI-PV: 952969
SGI-Modid: xfs-linux-melb:xfs-kern:26251a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# c41564b5 28-Mar-2006 Nathan Scott <nathans@sgi.com>

[XFS] We really suck at spulling. Thanks to Chris Pascoe for fixing all
these typos.

SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:25539a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 8758280f 13-Mar-2006 Nathan Scott <nathans@sgi.com>

[XFS] Cleanup the use of zones/slabs, more consistent and allows flags to
be passed.

SGI-PV: 949073
SGI-Modid: xfs-linux-melb:xfs-kern:25122a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 3ddb8fa9 10-Jan-2006 Nathan Scott <nathans@sgi.com>

[XFS] Sort out cosmetic differences between user and kernel copies of some
sources.

SGI-PV: 907752
SGI-Modid: xfs-linux-melb:xfs-kern:24659a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 7b718769 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Update license/copyright notices to match the prefered SGI
boilerplate.

SGI-PV: 913862
SGI-Modid: xfs-linux:xfs-kern:23903a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# a844f451 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Remove xfs_macros.c, xfs_macros.h, rework headers a whole lot.

SGI-PV: 943122
SGI-Modid: xfs-linux:xfs-kern:23901a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 61c1e689 01-Nov-2005 Christoph Hellwig <hch@sgi.com>

[XFS] remove unused struct xfs_ail_ticket

SGI-PV: 919278
SGI-Modid: xfs-linux:xfs-kern:199498a

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# f538d4da 01-Nov-2005 Christoph Hellwig <hch@sgi.com>

[XFS] write barrier support Issue all log sync operations as ordered
writes. In addition flush the disk cache on fsync if the sync cached
operation didn't sync the log to disk (this requires some additional
bookeping in the transaction and log code). If the device doesn't claim to
support barriers, the filesystem has an extern log volume or the trial
superblock write with barriers enabled failed we disable barriers and
print a warning. We should probably fail the mount completely, but that
could lead to nasty boot failures for the root filesystem. Not enabled by
default yet, needs more destructive testing first.

SGI-PV: 912426
SGI-Modid: xfs-linux:xfs-kern:198723a

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# efa092f3 04-Sep-2005 Tim Shimmin <tes@sgi.com>

[XFS] Fixes a bug in the quota code when allocating a new dquot record
which can cause an extent hole to be filled and a free extent to be
processed. In this case, we make a few mistakes: forget to pass back the
transaction, forget to put a hold on the buffer and forget to add the buf
to the new transaction.

SGI-PV: 940366
SGI-Modid: xfs-linux:xfs-kern:23594a

Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# 7e9c6396 02-Sep-2005 Tim Shimmin <tes@sgi.com>

[XFS] 929956 add log debugging and tracing info

SGI-PV: 931456
SGI-Modid: xfs-linux:xfs-kern:23155a

Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# 4372d6e1 20-Jun-2005 Christoph Hellwig <hch@sgi.com>

[XFS] Remove dead code. Patch from Adrian Bunk

SGI-PV: 936255
SGI-Modid: xfs-linux:xfs-kern:192759a

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!