#
f13e01b8 |
|
16-May-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: ensure fast fsync waits for ordered extents after a write failure If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents. This is because dropping the extent maps happens when completing ordered extents, at btrfs_finish_one_ordered(), and the completion of an ordered extent is executed in a work queue. This can result in a fast fsync to start logging file extent items based on existing extent maps before the ordered extents complete, therefore resulting in a log that has file extent items that point to unwritten extents, resulting in a corrupt file if a crash happens after and the log tree is replayed the next time the fs is mounted. This can happen for both direct IO writes and buffered writes. For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error: 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened; 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag; 3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extent maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag); 4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode; 5) After that a fast fsync starts and locks the inode; 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete, which is an intentional behaviour in order to reduce latency. For the buffered write case, here's one example: 1) A fast fsync begins, and it starts by flushing delalloc and waiting for the writeback to complete by calling filemap_fdatawait_range(); 2) Flushing the dellaloc created a new extent map X; 3) During the writeback some IO error happened, and at the end io callback (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its completion; 4) After queuing the ordered extent completion, the end io callback clears the writeback flag from all pages (or folios), and from that moment the fast fsync can proceed; 5) The fast fsync proceeds sees extent map X and logs a file extent item based on extent map X, resulting in a log that points to an unwritten data extent - because the ordered extent completion hasn't run yet, it happens only after the logging. To fix this make btrfs_finish_ordered_extent() set the inode flag BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write, so that a fast fsync will wait for ordered extent completion. Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
65bb9fb0 |
|
09-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update comment for btrfs_set_inode_full_sync() about locking Nowadays we have a lock used to synchronize mmap writes with reflink and fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment for btrfs_set_inode_full_sync() to mention that it can also be called while holding that mmap lock. Besides being a valid alternative to the inode's VFS lock, we already have the extent map shrinker using that mmap lock instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5e485ac6 |
|
10-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: export find_next_inode() as btrfs_find_first_inode() Export the relocation private helper find_next_inode() to inode.c, as this same logic is also used at btrfs_prune_dentries() and will be used by an upcoming change that adds an extent map shrinker. The next patch will change btrfs_prune_dentries() to use this helper. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ddefc2a |
|
09-Apr-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: move btrfs_page_mkwrite() from inode.c into file.c btrfs_page_mkwrite() is a struct vm_operations_struct callback and we define that structure in file.c. Currently the function is in inode.c and has to be exported to be used in file.c, which makes no sense because it's not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into file.c. While at it do a few minor style changes: 1) Capitalize the first word of every comment and end each sentence with punctuation; 2) Avoid splitting some statements into two lines when everything fits in 85 characters or less. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a8a57f9 |
|
22-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: merge btrfs_del_delalloc_inode() helpers The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode() don't follow the pattern when the "__" helper does a special case and are in fact reversed regarding the naming. We can merge them into one as there's only one place that needs to be open coded. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bdc0f89e |
|
09-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: reduce inode lock critical section when setting and clearing delalloc When setting and clearing a delalloc range, at btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent(), we are adding/removing the inode to/from the root's list of delalloc inodes while under the protection of the inode's lock. This however is not needed, we can add and remove the inode to the root's list without holding the inode's lock because here we are under the protection of the io tree's lock, reducing the size of the critical section delimited by the inode's lock. The inode's lock is used in many other places such as when finishing an ordered extent (when calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(), or decreasing the number of outstanding extents) or when reserving space when doing a buffered or direct IO write (calls to functions from delalloc-space.c). So move the inode add/remove operations to the root's list of delalloc inodes to outside the critical section delimited by the inode's lock. This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag since we can rely on the inode's delalloc bytes counter to determine if the inode is or is not in the list. The following fio based test, that exercises IO to multiple files in the same subvolume, was used to test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio --direct=0 --ioengine=sync --thread --directory=$MNT \ --invalidate=1 --group_reporting=1 \ --new_group --rw=randwrite --size=50m --numjobs=200 \ --bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \ --name=foo --filename_format=FioWorkloads.\$jobnum umount $MNT The test was run on a non-debug kernel (Debian's default kernel config) against a 16G null block device. Result before this patch: WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec Result after this patch: WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5169f12 |
|
08-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: stop passing root argument to __btrfs_del_delalloc_inode() There's no need to pass a root argument to __btrfs_del_delalloc_inode() and btrfs_del_delalloc_inode(), we can just pass the inode since the root is always the root associated to that inode. Some remove the root argument from these functions. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5693a128 |
|
26-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: add forward declarations and headers, part 3 Do a cleanup in the rest of the headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8bab0a30 |
|
07-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove the pg_offset parameter from btrfs_get_extent() The parameter @pg_offset of btrfs_get_extent() is only utilized for inlined extent, and we already have an ASSERT() and tree-checker, to make sure we can only get inline extent at file offset 0. Any invalid inline extent with non-zero file offset would be rejected by tree-checker in the first place. Thus the @pg_offset parameter is not really necessary, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4618d0a6 |
|
05-Dec-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix mismatching parameter names for btrfs_get_extent() The definition for btrfs_get_extent() is using "u64 end" as the last parameter, but in implementation we go "u64 len", and all call sites follows the implementation. This can be very confusing during development, as most developers including me, would just use the snippet returned by LSP (clangd in my case), which would only check the definition. Unfortunately this mismatch is introduced from the very beginning of btrfs. Fix it to prevent further confusion. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
637e6e0f |
|
30-Nov-2023 |
David Sterba <dsterba@suse.com> |
btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES The file_extent_tree was added in 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") so we have an explicit mapping of the file extents to know where it is safe to update i_size. When the feature NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree is not necessary. To save some space in the inode, allocate the tree only when necessary. This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release config. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed9b50a1 |
|
29-Nov-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: cache that we don't have security.capability set When profiling a workload I noticed we were constantly calling getxattr. These were mostly coming from __remove_privs, which will lookup if security.capability exists to remove it. However instrumenting getxattr showed we get called nearly constantly on an idle machine on a lot of accesses. These are wasteful and not free. Other security LSMs have a way to cache their results, but capability doesn't have this, so it's asking us all the time for the xattr. Fix this by setting a flag in our inode that it doesn't have a security.capability xattr. We set this on new inodes and after a failed lookup of security.capability. If we set this xattr at all we'll clear the flag. I haven't found a test in fsperf that this makes a visible difference on, but I assume fs_mark related tests would show it clearly. This is a perf report output of the smallfiles100k run where it shows 20% of our time spent in __remove_privs because we're looking up the non-existent xattr. --21.86%--btrfs_write_check.constprop.0 --21.62%--__file_remove_privs --21.55%--security_inode_need_killpriv --21.54%--cap_inode_need_killpriv --21.53%--__vfs_getxattr --20.89%--btrfs_getxattr Obviously this is just CPU time in a mostly IO bound test, so the actual effect of removing this callchain is minimal. However in just normal testing of an idle system tracing showed around 100 getxattr calls per minute, and with this patch there are 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c6e8f898 |
|
29-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: open code timespec64 in struct btrfs_inode The type of timespec64::tv_nsec is 'unsigned long', while we have only u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode. Add separate members for sec and nsec with the corresponding type width. This creates a 4 byte hole in btrfs_inode which can be utilized in the future. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
68539bd0 |
|
04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update comment for struct btrfs_inode::lock Update the comment for the lock named "lock" in struct btrfs_inode because it does not mention that the fields "delalloc_bytes", "defrag_bytes", "csum_bytes", "outstanding_extents" and "disk_i_size" are also protected by that lock. Also add a comment on top of each field protected by this lock to mention that the lock protects them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9850787 |
|
04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing last_log_commit Currently, the last_log_commit of a root can be accessed concurrently without any lock protection. Readers can be calling btrfs_inode_in_log() early in a fsync call, which reads a root's last_log_commit, while a writer can change the last_log_commit while a log tree if being synced, at btrfs_sync_log(). Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_log_commit field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
398fb913 |
|
27-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: reorder btrfs_inode to fill gaps Previous commit created a hole in struct btrfs_inode, we can move outstanding_extents there. This reduces size by 8 bytes from 1120 to 1112 on a release config. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
54c65371 |
|
27-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode The structure btrfs_ordered_inode_tree is used only in one place, in btrfs_inode. The structure itself has a 4 byte hole which is wasted space. Move the btrfs_ordered_inode_tree members to btrfs_inode with a common prefix 'ordered_tree_' where the hole can be utilized and shrink inode size. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b9d0322 |
|
22-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove redundant root argument from btrfs_update_inode() The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0a5d0dc5 |
|
22-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove redundant root argument from btrfs_update_inode_fallback() The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e41570d3 |
|
07-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: reduce size and reorder compression members in struct btrfs_inode Currently the compression type values are bounded and fit to an u8, we can pack the btrfs_inode a bit by reordering them to the space created by the location key. This reduces size from 1112 to 1104. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5335f437 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: include trace header in where necessary If we no longer include the tracepoints from ctree.h we fail to compile because we have the dependency in some of the header files and source files. Add the include where we have these dependencies to allow us to remove the include from ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
04cc63d1 |
|
25-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add fscrypt related dependencies to respective headers These headers have struct fscrypt_str as function arguments, so add struct fscrypt_str to the theader, and include linux/fscrypt.h in btrfs_inode.h as it also needs the definition of struct fscrypt_name for the new inode args. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c56cbe90 |
|
28-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: reduce the number of arguments to btrfs_run_delalloc_range Instead of a separate page_started argument that tells the callers that btrfs_run_delalloc_range already started writeback by itself, overload the return value with a positive 1 in additio to 0 and a negative error code to indicate that is has already started writeback, and remove the nr_written argument as that caller can calculate it directly based on the range, and in fact already does so for the case where writeback wasn't started yet. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6648cedd |
|
28-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove btrfs_writepage_endio_finish_ordered btrfs_writepage_endio_finish_ordered is a small wrapper around btrfs_mark_ordered_io_finished that just changs the argument passing slightly, and adds a tracepoint. Move the tracpoint to btrfs_mark_ordered_io_finished, which means it now also covers the error handling in btrfs_cleanup_ordered_extent and switch all callers to just call btrfs_mark_ordered_io_finished directly. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
64425500 |
|
18-Jun-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: tracepoints: also show actual number of the outstanding extents The btrfs_inode_mod_outstanding_extents trace event only shows the modified number to the number of outstanding extents. It would be helpful if we can see the resulting extent number as well. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
71df088c |
|
24-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: defer splitting of ordered extents until I/O completion The btrfs zoned completion code currently needs an ordered_extent and extent_map per bio so that it can account for the non-predictable write location from Zone Append. To archive that it currently splits the ordered_extent and extent_map at I/O submission time, and then records the actual physical address in the ->physical field of the ordered_extent. This patch instead switches to record the "original" physical address that the btrfs allocator assigned in spare space in the btrfs_bio, and then rewrites the logical address in the btrfs_ordered_sum structure at I/O completion time. This allows the ordered extent completion handler to simply walk the list of ordered csums and split the ordered extent as needed. This removes an extra ordered extent and extent_map lookup and manipulation during the I/O submission path, and instead batches it in the I/O completion path where we need to touch these anyway. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f541833c |
|
29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move split_flags/combine_flags helpers to inode-item.h These are more related to the inode item flags on disk than the in-memory btrfs_inode, move the helpers to inode-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e917ff56 |
|
03-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: determine synchronous writers from bio or writeback control The writeback_control structure already passes down the information about a writeback being synchronous from the core VM code, and thus information is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags helper. Use that information to decide if checksums calculation is offloaded to a workqueue instead of btrfs_inode::sync_writers field that not only bloats the inode but also has too wide scope, being inode wide instead of limited to the actual writeback request. The sync writes were set in: - btrfs_do_write_iter - regular IO, sync status is set - start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL mode - btrfs_write_marked_extents - write marked extents, writeback with WB_SYNC_ALL mode Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fa4b8cb1 |
|
05-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid iterating over all indexes when logging directory When logging a directory, after copying all directory index items from the subvolume tree to the log tree, we iterate over the subvolume tree to find all dir index items that are located in leaves COWed (or created) in the current transaction. If we keep logging a directory several times during the same transaction, we end up iterating over the same dir index items everytime we log the directory, wasting time and adding extra lock contention on the subvolume tree. So just keep track of the last logged dir index offset in order to start the search for that index (+1) the next time the directory is logged, as dir index values (key offsets) come from a monotonically increasing counter. The following test measures the difference before and after this change: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT # Time values in milliseconds. declare -a fsync_times # Total number of files added to the test directory. num_files=1000000 # Fsync directory after every N files are added. fsync_period=100 mkdir $MNT/testdir fsync_total_time=0 for ((i = 1; i <= $num_files; i++)); do echo -n > $MNT/testdir/file_$i if [ $((i % fsync_period)) -eq 0 ]; then start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) fsync_total_time=$((fsync_total_time + (end - start))) fsync_times[i]=$(( (end - start) / 1000000 )) echo -n -e "Progress $i / $num_files\r" fi done echo -e "\nHistogram of directory fsync duration in ms:\n" printf '%s\n' "${fsync_times[@]}" | \ perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);' fsync_total_time=$((fsync_total_time / 1000000)) echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n" echo umount $MNT The test was run on a non-debug kernel (Debian's default kernel config) against a 15G null block device. Result before this change: Histogram of directory fsync duration in ms: Count: 10000 Range: 3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751 Percentiles: 90th: 71.000; 95th: 77.000; 99th: 81.000 3.000 - 5.278: 1423 ################################# 5.278 - 8.854: 1173 ########################### 8.854 - 14.467: 591 ############## 14.467 - 23.277: 1025 ####################### 23.277 - 37.105: 1422 ################################# 37.105 - 58.809: 2036 ############################################### 58.809 - 92.876: 2316 ##################################################### 92.876 - 146.346: 6 | 146.346 - 230.271: 6 | 230.271 - 362.000: 2 | Total time spent in fsync: 350527 ms Result after this change: Histogram of directory fsync duration in ms: Count: 10000 Range: 3.000 - 1088.000; Mean: 8.704; Median: 8.000; Stddev: 12.576 Percentiles: 90th: 12.000; 95th: 14.000; 99th: 17.000 3.000 - 6.007: 3222 ################################# 6.007 - 11.276: 5197 ##################################################### 11.276 - 20.506: 1551 ################ 20.506 - 36.674: 24 | 36.674 - 201.552: 1 | 201.552 - 353.841: 4 | 353.841 - 1088.000: 1 | Total time spent in fsync: 92114 ms Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7edd339c |
|
27-Mar-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: pass an ordered_extent to btrfs_extract_ordered_extent To prepare for a new caller that already has the ordered_extent available, change btrfs_extract_ordered_extent to take an argument for it. Add a wrapper for the bio case that still has to do the lookup (for now). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
35a8d7da |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove now spurious bio submission helpers Call btrfs_submit_bio and btrfs_submit_compressed_read directly from submit_one_bio now that all additional functionality has moved into btrfs_submit_bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69ccf3f4 |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: handle recording of zoned writes in the storage layer Move the code that splits the ordered extents and records the physical location for them to the storage layer so that the higher level consumers don't have to care about physical block numbers at all. This will also allow to eventually remove accounting for the zone append write sizes in the upper layer with a little bit more block layer work. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
deb6216f |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: open code the submit_bio_start helpers The submit helpers are now trivial and can be called directly. Note that btree_csum_one_bio has to be moved up in the file a bit to avoid a forward declaration. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0571b635 |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove the io_failure_record infrastructure struct io_failure_record and the io_failure_tree tree are unused now, so remove them. This in turn makes struct btrfs_inode smaller by 16 bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d49d0d3 |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove now unused checksumming helpers Remove the unused btrfs_verify_data_csum helper, and fold btrfs_check_data_csum into its only caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5219044 |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: add a btrfs_data_csum_ok helper Add a new checksumming helper that wraps btrfs_check_data_csum and does all the checks to if we're dealing with some form of nodatacsum I/O. This helper will be used by the new storage layer checksum validation and repair code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2d40141 |
|
12-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port inode_init_owner() to mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
e55cf7ca |
|
27-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_add_delayed_iput The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bd54766e |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_clear_delalloc_extent The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62798a49 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_split_delalloc_extent The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c5d166f |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_set_delalloc_extent The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2454151c |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_merge_delalloc_extent The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3c4f91e2 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_delete_subvolume The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
621af94a |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_check_data_csum The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5d4d75b |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_inode_unlock The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
29b6352b |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_inode_lock The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d781c1c3 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_submit_dio_repair_bio The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b7620416 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_submit_data_read_bio The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
535a7e5d |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_submit_data_write_bio The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bfa17066 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_submit_bio_start_direct_io The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
882681ac |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_submit_bio_start The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad65ecf3 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: simplify btree_submit_bio_start and btrfs_submit_bio_start parameters After previous patches the unused parameters can be removed from btree_submit_bio_start and btrfs_submit_bio_start as they don't need to conform to the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab2072b2 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: change how submit bio callback is passed to btrfs_wq_submit_bio There's a callback function parameter for btrfs_wq_submit_bio that can be one of: metadata, buffered data, direct io data. The callback abstraction is unnecessary as we have all functions available. Replace the parameter with a command that leads to a direct call in run_one_async_start. The called functions can be then simplified and we can also remove the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7920b773 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: drop parameter compression_type from btrfs_submit_dio_repair_bio Compression and direct io don't work together so the compression parameter can be dropped after previous patch that changed the call to direct. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19af6a7d |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: change how repair action is passed to btrfs_repair_one_sector There's a function pointer passed to btrfs_repair_one_sector that will submit the right bio for repair. However there are only two callbacks, for buffered and for direct IO. This can be simplified to a bool-based switch and call either function, indirect calls in this case is an unnecessary abstraction. This allows to remove the submit_bio_hook_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2885fd63 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move inode prototypes to btrfs_inode.h I initially wanted to make a new header file for this, but these prototypes do naturally fit into btrfs_inode.h. If we want to extract vfs from pure btrfs code in the future we may need to split this up, but btrfs_inode embeds the vfs_inode, so it makes sense to put the prototypes in this header for now. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f60acad3 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_print_data_csum_error into inode.c This isn't used outside of inode.c, there's no reason to define it in btrfs_inode.h. Drop the inline and add __cold as it's for errors that are not in any hot path. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b9b8854 |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use a runtime flag to indicate an inode is a free space inode We always check the root of an inode as well as it's inode number to determine if it's a free space inode. This is problematic as the helper is in a header file where it doesn't have the fs_info definition. To avoid this and make the check a little cleaner simply add a flag to the runtime_flags to indicate that the inode is a free space inode, set that when we create the inode, and then change the helper to check for this flag. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e256927b |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: open code and remove btrfs_insert_inode_hash helper This exists to insert the btree_inode in the super blocks inode hash table. Since it's only used for the btree inode move the code to where we use it in disk-io.c and remove the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee8ba05c |
|
14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: open code and remove btrfs_inode_sectorsize helper This is defined in btrfs_inode.h, and dereferences btrfs_root and btrfs_fs_info, both of which aren't defined in btrfs_inode.h. Additionally, in many places we already have root or fs_info, so this helper often makes the code harder to read. So delete the helper and simply open code it in the few places that we use it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
87c11705 |
|
09-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: convert the io_failure_tree to a plain rb_tree We still have this oddity of stashing the io_failure_record in the extent state for the io_failure_tree, which is leftover from when we used to stuff private pointers in extent_io_trees. However this doesn't make a lot of sense for the io failure records, we can simply use a normal rb_tree for this. This will allow us to further simplify the extent_io_tree code by removing the io_failure_rec pointer from the extent state. Convert the io_failure_tree to an rb tree + spinlock in the inode, and then use our rb tree simple helpers to insert and find failed records. This greatly cleans up this code and makes it easier to separate out the extent_io_tree code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf2404a9 |
|
11-Jul-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add optimized btrfs_ino() version for 64 bits systems Currently btrfs_ino() tries to use first the objectid of the inode's location key. This is to avoid truncation of the inode number on 32 bits platforms because the i_ino field of struct inode has the unsigned long type, while the objectid is a 64 bits unsigned type (u64) on every system. This logic was added in commit 33345d01522f81 ("Btrfs: Always use 64bit inode number"). However if we are running on a 64 bits system, we can always directly return the i_ino value from struct inode, which eliminates the need for he special if statement that tests for a location key type of BTRFS_ROOT_ITEM_KEY - in which case i_ino may not have the same value as the objectid in the inode's location objectid, it may have a value of BTRFS_EMPTY_SUBVOL_DIR_OBJECTID, for the case of snapshots of trees with subvolumes/snapshots inside them. So add a special version for 64 bits system that directly returns i_ino of struct inode. This eliminates one branch and reduces the overall code size, since btrfs_ino() is an inline function that is extensively used. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1612028 189180 29032 1830240 1bed60 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
adac5584 |
|
11-Jul-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: set the objectid of the btree inode's location key We currently don't use the location key of the btree inode, its content is set to zeroes, as it's a special inode that is not persisted (it has no inode item stored in any btree). At btrfs_ino(), an inline function used extensively in btrfs, we have this special check if the given inode's location objectid is 0, and if it is, we return the value stored in the VFS' inode i_ino field instead (which is BTRFS_BTREE_INODE_OBJECTID for the btree inode). To reduce the code at btrfs_ino(), we can simply set the objectid of the btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the need to check for the special case of the objectid being zero, with the side effect of reducing the overall code size and having less code to execute, as btrfs_ino() is an inline function. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0201fceb |
|
06-Jul-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove the inode cache check at btrfs_is_free_space_inode() The inode cache feature was removed in kernel 5.11, and we no longer have any code that reads from or writes to inode caches. We may still mount a filesystem that has inode caches, but they are ignored. Remove the check for an inode cache from btrfs_is_free_space_inode(), since we no longer have code to trigger reads from an inode cache or writes to an inode cache. The check at send.c is still needed, because in case we find a filesystem with an inode cache, we must ignore it. Also leave the checks at tree-checker.c, as they are sanity checks. This eliminates a dead branch and reduces the amount of code since it's in an inline function. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a3e171a0 |
|
05-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: move struct btrfs_dio_private to inode.c The btrfs_dio_private structure is only used in inode.c, so move the definition there. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
acb8b52a |
|
05-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove the disk_bytenr in struct btrfs_dio_private This field is never used, so remove it. Last use was probably in 23ea8e5a0767 ("Btrfs: load checksum data once when submitting a direct read io"). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e6f9d696 |
|
15-Apr-2022 |
Chung-Chiang Cheng <cccheng@synology.com> |
btrfs: export a helper for compression hard check inode_can_compress will be used outside of inode.c to check the availability of setting compression flag by xattr. This patch moves this function as an internal helper and renames it to btrfs_inode_can_compress. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23e3337f |
|
16-Feb-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: reset last_reflink_trans after fsyncing inode When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit 40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree") and commit e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")). We also need to make sure a full fsync will copy all old file extent items it finds in modified leaves, because they might have been copied from some other inode. However once we fsync an inode, we don't need to keep paying the price of that extra special care in future fsyncs done in the same transaction, unless the inode is used for another reflink operation or the full sync flag is set on it (truncate, failure to allocate extent maps for holes, and other exceptional and infrequent cases). So after we fsync an inode reset its last_unlink_trans to zero. In case another reflink happens, we continue to update the last_reflink_trans of the inode, just as before. Also set last_reflink_trans to the generation of the last transaction that modified the inode whenever we need to set the full sync flag on the inode, just like when we need to load an inode from disk after eviction. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
528ee697 |
|
14-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: put initial index value of a directory in a constant At btrfs_set_inode_index_count() we refer twice to the number 2 as the initial index value for a directory (when it's empty), with a proper comment explaining the reason for that value. In the next patch I'll have to use that magic value in the directory logging code, so put the value in a #define at btrfs_inode.h, to avoid hardcoding the magic value again at tree-log.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
339d0354 |
|
25-Oct-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: only copy dir index keys when logging a directory Currently, when logging a directory, we copy both dir items and dir index items from the fs/subvolume tree to the log tree. Both items have exactly the same data (same struct btrfs_dir_item), the difference lies in the key values, where a dir index key contains the index number of a directory entry while the dir item key does not, as it's used for doing fast lookups of an entry by name, while the former is used for sorting entries when listing a directory. We can exploit that and log only the dir index items, since they contain all the information needed to correctly add, replace and delete directory entries when replaying a log tree. Logging only the dir index items is also backward and forward compatible: an unpatched kernel (without this change) can correctly replay a log tree generated by a patched kernel (with this patch), and a patched kernel can correctly replay a log tree generated by an unpatched kernel. The backward compatibility is ensured because: 1) For inserting a new dentry: a dentry is only inserted when we find a new dir index key - we can only insert if we know the dir index offset, which is encoded in the dir index key's offset; 2) For deleting dentries: during log replay, before adding or replacing dentries, we first replay dentry deletions. Whenever we find a dir item key or a dir index key in the subvolume/fs tree that is not logged in a range for which the log tree is authoritative, we do the unlink of the dentry, which removes both the existing dir item key and the dir index key. Therefore logging just dir index keys is enough to ensure dentry deletions are correctly replayed; 3) For dentry replacements: they work when we log only dir index keys and this is mostly due to a combination of 1) and 2). If we replace a dentry with name "foobar" to point from inode A to inode B, then we know the dir index key for the new dentry is different from the old one, as it has an index number (key offset) larger than the old one. This results in replaying a deletion, through replay_dir_deletes(), that causes the old dentry to be removed, both the dir item key and the dir index key, as mentioned at 2). Then when processing the new dir index key, we add the new dentry, adding both a new dir item key and a new index key pointing to inode B, as stated in 1). The forward compatibility, the ability for a patched kernel to replay a log created by an older, unpatched kernel, comes from the changes required for making sure we are able to replay a log that only contains dir index keys - we simply ignore every dir item key we find. So modify directory logging to log only dir index items, and modify the log replay process to ignore dir item keys, from log trees created by an unpatched kernel, and process only with dir index keys. This reduces the amount of logged metadata by about half, and therefore the time spent logging or fsyncing large directories (less CPU time and less IO). The following test script was used to measure this change: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The tests were run on a physical machine, with a non-debug kernel (Debian's default kernel config), for different values of $NUM_NEW_FILES and $NUM_FILE_DELETES, and the results were the following: ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 8412 ms after adding 1000000 files dir fsync took 500 ms after deleting 10000 files ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 4252 ms after adding 1000000 files (-49.5%) dir fsync took 269 ms after deleting 10000 files (-46.2%) ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 745 ms after adding 100000 files dir fsync took 59 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 404 ms after adding 100000 files (-45.8%) dir fsync took 31 ms after deleting 1000 files (-47.5%) ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 67 ms after adding 10000 files dir fsync took 9 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 36 ms after adding 10000 files (-46.3%) dir fsync took 5 ms after deleting 1000 files (-44.4%) ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 9 ms after adding 1000 files dir fsync took 4 ms after deleting 100 files ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 7 ms after adding 1000 files (-22.2%) dir fsync took 3 ms after deleting 100 files (-25.0%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
47926ab5 |
|
08-Oct-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename btrfs_dio_private::logical_offset to file_offset The naming of "logical_offset" can be confused with logical bytenr of the dio range. In fact it's file offset, and the naming "file_offset" is already widely used in all other sites. Just do the rename to avoid confusion. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dc287224 |
|
16-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: keep track of the last logged keys when logging a directory After the first time we log a directory in the current transaction, for each directory item in a changed leaf of the subvolume tree, we have to check if we previously logged the item, in order to overwrite it in case its data changed or skip it in case its data hasn't changed. Checking if we have logged each item before not only wastes times, but it also adds lock contention on the log tree. So in order to minimize the number of times we do such checks, keep track of the offset of the last key we logged for a directory and, on the next time we log the directory, skip the checks for any new keys that have an offset greater than the offset we have previously saved. This is specially effective for index keys, because the offset for these keys comes from a monotonically increasing counter. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 5/5. The following test was used on a non-debug kernel to measure the impact it has on a directory fsync: $ cat test-dir-fsync.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=100000 NUM_FILE_DELETES=1000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT Test results with NUM_NEW_FILES set to 100 000 and 1 000 000: **** before patchset, 100 000 files, 1000 deletes **** dir fsync took 848 ms after adding 100000 files dir fsync took 175 ms after deleting 1000 files **** after patchset, 100 000 files, 1000 deletes **** dir fsync took 758 ms after adding 100000 files (-11.2%) dir fsync took 63 ms after deleting 1000 files (-94.1%) **** before patchset, 1 000 000 files, 1000 deletes **** dir fsync took 9945 ms after adding 1000000 files dir fsync took 473 ms after deleting 1000 files **** after patchset, 1 000 000 files, 1000 deletes **** dir fsync took 8677 ms after adding 1000000 files (-13.6%) dir fsync took 146 ms after deleting 1000 files (-105.6%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14605409 |
|
30-Jun-2021 |
Boris Burkov <boris@bur.io> |
btrfs: initial fsverity support Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
77eea05e |
|
30-Jun-2021 |
Boris Burkov <boris@bur.io> |
btrfs: add ro compat flags to inodes Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
209ecbb8 |
|
22-Feb-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove stale comment and logic from btrfs_inode_in_log() Currently btrfs_inode_in_log() checks the list of modified extents of the inode, and has a comment mentioning why, as it used to be necessary to make sure if we did something like the following: mmap write range A mmap write range B msync range A (ranged fsync) msync range B (ranged fsync) we ended up with both ranges being logged. If we did not check it, then the second fsync would do nothing because btrfs_inode_in_log() would return true. This was added in 125c4cf9f37c98 ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and test case generic/325 from fstests exercises that scenario. However, as of commit 487781796d3022 ("btrfs: make fast fsyncs wait only for writeback"), every ranged fsync is now turned into a full ranged fsync (operates on the range from 0 to LLONG_MAX), so it is now pointless to test of emptiness of the list of modified extents, and the comment is clearly outdated. So just remove the comment and list emptiness check, while also changing the function's return type to be a boolean instead of an integer. In case one day we get support for ranged fsyncs again, it will be easy to notice the check is necessary again, because it will make generic/325 always fail. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc0939fc |
|
22-Feb-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between marking inode needs to be logged and log syncing We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8318ba79 |
|
10-Feb-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a i_mmap_lock to our inode We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
523929f1 |
|
21-Dec-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: make btrfs_dio_private::bytes u32 btrfs_dio_private::bytes is only assigned from bio::bi_iter::bi_size, which is never larger than U32. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d45f221 |
|
02-Dec-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when cloning inline extent and low on free metadata space When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the target range (e.g. when the target range starts at an offset greater than zero). In such cases we copy the inline extent's data into a page of the destination inode and then dirty that page. However, after that we will need to start a transaction for each processed extent and, if we are ever low on available metadata space, we may need to flush existing delalloc for all dirty inodes in an attempt to release metadata space - if that happens we may deadlock: * the async reclaim task queued a delalloc work to flush delalloc for the destination inode of the clone operation; * the task executing that delalloc work gets blocked waiting for the range with the dirty page to be unlocked, which is currently locked by the task doing the clone operation; * the async reclaim task blocks waiting for the delalloc work to complete; * the cloning task is waiting on the waitqueue of its reservation ticket while holding the range with the dirty page locked in the inode's io_tree; * if metadata space is not released by some other task (like delalloc for some other inode completing for example), the clone task waits forever and as a consequence the delalloc work and async reclaim tasks will hang forever as well. Releasing more space on the other hand may require starting a transaction, which will hang as well when trying to reserve metadata space, resulting in a deadlock between all these tasks. When this happens, traces like the following show up in dmesg/syslog: [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [87452.326136] Call Trace: [87452.326737] __schedule+0x5d1/0xcf0 [87452.327390] schedule+0x45/0xe0 [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs] [87452.328894] ? finish_wait+0x90/0x90 [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs] [87452.330133] ? __mod_memcg_state+0x8e/0x160 [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs] [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs] [87452.332007] ? lock_release+0x20e/0x4c0 [87452.332557] ? trace_hardirqs_on+0x1b/0xf0 [87452.333127] extent_writepages+0x43/0x90 [btrfs] [87452.333653] ? lock_acquire+0x1a3/0x490 [87452.334177] do_writepages+0x43/0xe0 [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100 [87452.335720] __filemap_fdatawrite_range+0xc5/0x100 [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs] [87452.337838] process_one_work+0x24e/0x5e0 [87452.338437] worker_thread+0x50/0x3b0 [87452.339137] ? process_one_work+0x5e0/0x5e0 [87452.339884] kthread+0x153/0x170 [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.341153] ret_from_fork+0x22/0x30 [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [87452.345655] Call Trace: [87452.346305] __schedule+0x5d1/0xcf0 [87452.346947] ? kvm_clock_read+0x14/0x30 [87452.347676] ? wait_for_completion+0x81/0x110 [87452.348389] schedule+0x45/0xe0 [87452.349077] schedule_timeout+0x30c/0x580 [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [87452.350340] ? lock_acquire+0x1a3/0x490 [87452.351006] ? try_to_wake_up+0x7a/0xa20 [87452.351541] ? lock_release+0x20e/0x4c0 [87452.352040] ? lock_acquired+0x199/0x490 [87452.352517] ? wait_for_completion+0x81/0x110 [87452.353000] wait_for_completion+0xab/0x110 [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs] [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] [87452.354455] flush_space+0x24f/0x660 [btrfs] [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] [87452.355565] process_one_work+0x24e/0x5e0 [87452.356024] worker_thread+0x20f/0x3b0 [87452.356487] ? process_one_work+0x5e0/0x5e0 [87452.356973] kthread+0x153/0x170 [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.357880] ret_from_fork+0x22/0x30 (...) < stack traces of several tasks waiting for the locks of the inodes of the clone operation > (...) [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97 [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960 [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000 [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000 [92867.447920] Call Trace: [92867.448435] __schedule+0x5d1/0xcf0 [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [92867.449423] schedule+0x45/0xe0 [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs] [92867.450576] ? finish_wait+0x90/0x90 [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs] [92867.452412] start_transaction+0x2d1/0x760 [btrfs] [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs] [92867.453848] ? lock_release+0x20e/0x4c0 [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs] [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs] [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs] [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs] [92867.457213] do_clone_file_range+0xd4/0x1f0 [92867.457828] vfs_clone_file_range+0x4d/0x230 [92867.458355] ? lock_release+0x20e/0x4c0 [92867.458890] ioctl_file_clone+0x8f/0xc0 [92867.459377] do_vfs_ioctl+0x342/0x750 [92867.459913] __x64_sys_ioctl+0x62/0xb0 [92867.460377] do_syscall_64+0x33/0x80 [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (...) < stack traces of more tasks blocked on metadata reservation like the clone task above, because the async reclaim task has deadlocked > (...) Another thing to notice is that the worker task that is deadlocked when trying to flush the destination inode of the clone operation is at btrfs_invalidatepage(). This is simply because the clone operation has a destination offset greater than the i_size and we only update the i_size of the destination file after cloning an extent (just like we do in the buffered write path). Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger the flushing of delalloc for all inodes that have delalloc, add a runtime flag to an inode to signal it should not be flushed, and for inodes with that flag set, start_delalloc_inodes() will simply skip them. When the cloning code needs to dirty a page to copy an inline extent, set that flag on the inode and then clear it when the clone operation finishes. This could be sporadically triggered with test case generic/269 from fstests, which exercises many fsstress processes running in parallel with several dd processes filling up the entire filesystem. CC: stable@vger.kernel.org # 5.9+ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2f121ab |
|
13-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: skip unnecessary searches for xattrs when logging an inode Every time we log an inode we lookup in the fs/subvol tree for xattrs and if we have any, log them into the log tree. However it is very common to have inodes without any xattrs, so doing the search wastes times, but more importantly it adds contention on the fs/subvol tree locks, either making the logging code block and wait for tree locks or making the logging code making other concurrent operations block and wait. The most typical use cases where xattrs are used are when capabilities or ACLs are defined for an inode, or when SELinux is enabled. This change makes the logging code detect when an inode does not have xattrs and skip the xattrs search the next time the inode is logged, unless the inode is evicted and loaded again or a xattr is added to the inode. Therefore skipping the search for xattrs on inodes that don't ever have xattrs and are fsynced with some frequency. The following script that calls dbench was used to measure the impact of this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device directly (no intermediary filesystem on the host) and using a non-debug kernel (default configuration on Debian distributions): $ cat test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f -m single -d single $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 200 40 umount $MNT The results before this change: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5761605 0.172 312.057 Close 4232452 0.002 10.927 Rename 243937 1.406 277.344 Unlink 1163456 0.631 298.402 Deltree 160 11.581 221.107 Mkdir 80 0.003 0.005 Qpathinfo 5221410 0.065 122.309 Qfileinfo 915432 0.001 3.333 Qfsinfo 957555 0.003 3.992 Sfileinfo 469244 0.023 20.494 Find 2018865 0.448 123.659 WriteX 2874851 0.049 118.529 ReadX 9030579 0.004 21.654 LockX 18754 0.003 4.423 UnlockX 18754 0.002 0.331 Flush 403792 10.944 359.494 Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms The results after this change: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 6442521 0.159 230.693 Close 4732357 0.002 10.972 Rename 272809 1.293 227.398 Unlink 1301059 0.563 218.500 Deltree 160 7.796 54.887 Mkdir 80 0.008 0.478 Qpathinfo 5839452 0.047 124.330 Qfileinfo 1023199 0.001 4.996 Qfsinfo 1070760 0.003 5.709 Sfileinfo 524790 0.033 21.765 Find 2257658 0.314 125.611 WriteX 3211520 0.040 232.135 ReadX 10098969 0.004 25.340 LockX 20974 0.003 1.569 UnlockX 20974 0.002 3.475 Flush 451553 10.287 331.037 Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms +10.8% throughput, -8.2% max latency Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2766ff61 |
|
04-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update the number of bytes used by an inode atomically There are several occasions where we do not update the inode's number of used bytes atomically, resulting in a concurrent stat(2) syscall to report a value of used blocks that does not correspond to a valid value, that is, a value that does not match neither what we had before the operation nor what we get after the operation completes. In extreme cases it can result in stat(2) reporting zero used blocks, which can cause problems for some userspace tools where they can consider a file with a non-zero size and zero used blocks as completely sparse and skip reading data, as reported/discussed a long time ago in some threads like the following: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html The cases where this can happen are the following: -> Case 1 If we do a write (buffered or direct IO) against a file region for which there is already an allocated extent (or multiple extents), then we have a short time window where we can report a number of used blocks to stat(2) that does not take into account the file region being overwritten. This short time window happens when completing the ordered extent(s). This happens because when we drop the extents in the write range we decrement the inode's number of bytes and later on when we insert the new extent(s) we increment the number of bytes in the inode, resulting in a short time window where a stat(2) syscall can get an incorrect number of used blocks. If we do writes that overwrite an entire file, then we have a short time window where we report 0 used blocks to stat(2). Example reproducer: $ cat reproducer-1.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null expected=$(stat -c %b $MNT/foobar) # Create a process to keep calling stat(2) on the file and see if the # reported number of blocks used (disk space used) changes, it should # not because we are not increasing the file size nor punching holes. stat_loop $MNT/foobar $expected & loop_pid=$! for ((i = 0; i < 50000; i++)); do xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null done kill $loop_pid &> /dev/null wait umount $DEV $ ./reproducer-1.sh ERROR: unexpected used blocks (got: 0 expected: 128) ERROR: unexpected used blocks (got: 0 expected: 128) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 2 If we do a buffered write against a file region that does not have any allocated extents, like a hole or beyond EOF, then during ordered extent completion we have a short time window where a concurrent stat(2) syscall can report a number of used blocks that does not correspond to the value before or after the write operation, a value that is actually larger than the value after the write completes. This happens because once we start a buffered write into an unallocated file range we increment the inode's 'new_delalloc_bytes', to make sure any stat(2) call gets a correct used blocks value before delalloc is flushed and completes. However at ordered extent completion, after we inserted the new extent, we increment the inode's number of bytes used with the size of the new extent, and only later, when clearing the range in the inode's iotree, we decrement the inode's 'new_delalloc_bytes' counter with the size of the extent. So this results in a short time window where a concurrent stat(2) syscall can report a number of used blocks that accounts for the new extent twice. Example reproducer: $ cat reproducer-2.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foobar write_size=$((64 * 1024)) for ((i = 0; i < 16384; i++)); do offset=$(($i * $write_size)) xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null blocks_used=$(stat -c %b $MNT/foobar) # Fsync the file to trigger writeback and keep calling stat(2) on it # to see if the number of blocks used changes. stat_loop $MNT/foobar $blocks_used & loop_pid=$! xfs_io -c "fsync" $MNT/foobar kill $loop_pid &> /dev/null wait $loop_pid done umount $DEV $ ./reproducer-2.sh ERROR: unexpected used blocks (got: 265472 expected: 265344) ERROR: unexpected used blocks (got: 284032 expected: 283904) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 3 Another case where such problems happen is during other operations that replace extents in a file range with other extents. Those operations are extent cloning, deduplication and fallocate's zero range operation. The cause of the problem is similar to the first case. When we drop the extents from a range, we decrement the inode's number of bytes, and later on, after inserting the new extents we increment it. Since this is not done atomically, a concurrent stat(2) call can see and return a number of used blocks that is smaller than it should be, does not match the number of used blocks before or after the clone/deduplication/zero operation. Like for the first case, when doing a clone, deduplication or zero range operation against an entire file, we end up having a time window where we can report 0 used blocks to a stat(2) call. Example reproducer: $ cat reproducer-3.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f -m reflink=1 $DEV > /dev/null mount $DEV $MNT extent_size=$((64 * 1024)) num_extents=16384 file_size=$(($extent_size * $num_extents)) # File foo has many small extents. xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \ > /dev/null # File bar has much less extents and has exactly the same data as foo. xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null expected=$(stat -c %b $MNT/foo) # Now deduplicate bar into foo. While the deduplication is in progres, # the number of used blocks/file size reported by stat should not change xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null & dedupe_pid=$! while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do used=$(stat -c %b $MNT/foo) if [ $used -ne $expected ]; then echo "Unexpected blocks used: $used (expected: $expected)" fi done umount $DEV $ ./reproducer-3.sh Unexpected blocks used: 2076800 (expected: 2097152) Unexpected blocks used: 2097024 (expected: 2097152) Unexpected blocks used: 2079872 (expected: 2097152) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. So fix this by: 1) Making btrfs_drop_extents() not decrement the VFS inode's number of bytes, and instead return the number of bytes; 2) Making any code that drops extents and adds new extents update the inode's number of bytes atomically, while holding the btrfs inode's spinlock, which is also used by the stat(2) callback to get the inode's number of bytes; 3) For ranges in the inode's iotree that are marked as 'delalloc new', corresponding to previously unallocated ranges, increment the inode's number of bytes when clearing the 'delalloc new' bit from the range, in the same critical section that decrements the inode's 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock. An alternative would be to have btrfs_getattr() wait for any IO (ordered extents in progress) and locking the whole range (0 to (u64)-1) while it it computes the number of blocks used. But that would mean blocking stat(2), which is a very used syscall and expected to be fast, waiting for writes, clone/dedupe, fallocate, page reads, fiemap, etc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
223486c2 |
|
02-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: switch cached fs_info::csum_size from u16 to u32 The fs_info value is 32bit, switch also the local u16 variables. This leads to a better assembly code generated due to movzwl. This simple change will shave some bytes on x86_64 and release config: text data bss dec hex filename 1090000 17980 14912 1122892 11224c pre/btrfs.ko 1089794 17980 14912 1122686 11217e post/btrfs.ko DELTA: -206 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
55fc29be |
|
29-Jun-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use cached value of fs_info::csum_size everywhere btrfs_get_16 shows up in the system performance profiles (helper to read 16bit values from on-disk structures). This is partially because of the checksum size that's frequently read along with data reads/writes, other u16 uses are from item size or directory entries. Replace all calls to btrfs_super_csum_size by the cached value from fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
502756b3 |
|
24-Sep-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: remove btrfs_inode::dio_sem The inode dio_sem can be eliminated because all DIO synchronization is now performed through inode->i_rwsem that provides the same guarantees. This reduces btrfs_inode size by 40 bytes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1fd4033d |
|
01-Oct-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag Commit 8d875f95da43 ("btrfs: disable strict file flushes for renames and truncates") eliminated the notion of ordered operations and instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag indicating that a file's content should be synced to disk in case a file is truncated and any writes happen to it concurrently. In fact this intendend behavior was broken until it was fixed in f6dc45c7a93a ("Btrfs: fix filemap_flush call in btrfs_file_release"). All things considered let's give the flag a more descriptive name. Also slightly reword comments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3c57805 |
|
24-Sep-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Since we now perform direct reads using i_rwsem, we can remove this inode flag used to co-ordinate unlocked reads. The truncate call takes i_rwsem. This means it is correctly synchronized with concurrent direct reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fee248d |
|
31-Aug-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: convert btrfs_inode_sectorsize to take btrfs_inode It's counterintuitive to have a function named btrfs_inode_xxx which takes a generic inode. Also move the function to btrfs_inode.h so that it has access to the definition of struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48778179 |
|
10-Aug-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make fast fsyncs wait only for writeback Currently regardless of a full or a fast fsync we always wait for ordered extents to complete, and then start logging the inode after that. However for fast fsyncs we can just wait for the writeback to complete, we don't need to wait for the ordered extents to complete since we use the list of modified extents maps to figure out which extents we must log and we can get their checksums directly from the ordered extents that are still in flight, otherwise look them up from the checksums tree. Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at fsync time"), for fast fsyncs, we used to start logging without even waiting for the writeback to complete first, we would wait for it to complete after logging, while holding a transaction open, which lead to performance issues when using cgroups and probably for other cases too, as wait for IO while holding a transaction handle should be avoided as much as possible. After that, for fast fsyncs, we started to wait for ordered extents to complete before starting to log, which adds some latency to fsyncs and we even got at least one report about a performance drop which bisected to that particular change: https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/ This change makes fast fsyncs only wait for writeback to finish before starting to log the inode, instead of waiting for both the writeback to finish and for the ordered extents to complete. This brings back part of the logic we had that extracts checksums from in flight ordered extents, which are not yet in the checksums tree, and making sure transaction commits wait for the completion of ordered extents previously logged (by far most of the time they have already completed by the time a transaction commit starts, resulting in no wait at all), to avoid any data loss if an ordered extent completes after the transaction used to log an inode is committed, followed by a power failure. When there are no other tasks accessing the checksums and the subvolume btrees, the ordered extent completion is pretty fast, typically taking 100 to 200 microseconds only in my observations. However when there are other tasks accessing these btrees, ordered extent completion can take a lot more time due to lock contention on nodes and leaves of these btrees. I've seen cases over 2 milliseconds, which starts to be significant. In particular when we do have concurrent fsyncs against different files there is a lot of contention on the checksums btree, since we have many tasks writing the checksums into the btree and other tasks that already started the logging phase are doing lookups for checksums in the btree. This change also turns all ranged fsyncs into full ranged fsyncs, which is something we already did when not using the NO_HOLES features or when doing a full fsync. This is to guarantee we never miss checksums due to writeback having been triggered only for a part of an extent, and we end up logging the full extent but only checksums for the written range, which results in missing checksums after log replay. Allowing ranged fsyncs to operate again only in the original range, when using the NO_HOLES feature and doing a fast fsync is doable but requires some non trivial changes to the writeback path, which can always be worked on later if needed, but I don't think they are a very common use case. Several tests were performed using fio for different numbers of concurrent jobs, each writing and fsyncing its own file, for both sequential and random file writes. The tests were run on bare metal, no virtualization, on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device, with a kernel configuration that is the default of typical distributions (debian in this case), without debug options enabled (kasan, kmemleak, slub debug, debug of page allocations, lock debugging, etc). The following script that calls fio was used: $ cat test-fsync.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/btrfs MOUNT_OPTIONS="-o ssd -o space_cache=v2" MKFS_OPTIONS="-d single -m single" if [ $# -ne 5 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 BLOCK_SIZE=$4 WRITE_MODE=$5 if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'" exit 1 fi cat <<EOF > /tmp/fio-job.ini [writers] rw=$WRITE_MODE fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=$BLOCK_SIZE ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The results were the following: ************************* *** sequential writes *** ************************* ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec After patch: WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec (+9.8%, -8.8% runtime) ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec After patch: WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec (+21.5% throughput, -17.8% runtime) ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec After patch: WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec (+28.7% throughput, -22.3% runtime) ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec After patch: WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec (+35.6% throughput, -25.2% runtime) ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec After patch: WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec (+34.1% throughput, -25.6% runtime) ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec After patch: WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec (+19.1% throughput, -16.4% runtime) ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ==== Before patch: WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec After patch: WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec (+23.1% throughput, -18.7% runtime) ************************ *** random writes *** ************************ ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec After patch: WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec (+0.9% throughput, -1.7% runtime) ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec After patch: WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec (+2.3% throughput, -2.0% runtime) ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec After patch: WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec (+15.6% throughput, -13.3% runtime) ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec After patch: WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec (+11.6% throughput, -10.7% runtime) ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec After patch: WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec (+3.8% throughput, -3.8% runtime) ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec After patch: WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec (+12.7% throughput, -11.2% runtime) ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ==== Before patch: WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec After patch: WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec (+6.3% throughput, -6.0% runtime) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3ebac17c |
|
14-Jul-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: reduce contention on log trees when logging checksums The possibility of extents being shared (through clone and deduplication operations) requires special care when logging data checksums, to avoid having a log tree with different checksum items that cover ranges which overlap (which resulted in missing checksums after replaying a log tree). Such problems were fixed in the past by the following commits: commit 40e046acbd2f ("Btrfs: fix missing data checksums after replaying a log tree") commit e289f03ea79b ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents") Test case generic/588 exercises the scenario solved by the first commit (purely sequential and deterministic) while test case generic/457 often triggered the case fixed by the second commit (not deterministic, requires specific timings under concurrency). The problems were addressed by deleting, from the log tree, any existing checksums before logging the new ones. And also by doing the deletion and logging of the cheksums while locking the checksum range in an extent io tree (root->log_csum_range), to deal with the case where we have concurrent fsyncs against files with shared extents. That however causes more contention on the leaves of a log tree where we store checksums (and all the nodes in the paths leading to them), even when we do not have shared extents, or all the shared extents were created by past transactions. It also adds a bit of contention on the spin lock of the log_csums_range extent io tree of the log root. This change adds a 'last_reflink_trans' field to the inode to keep track of the last transaction where a new extent was shared between inodes (through clone and deduplication operations). It is updated for both the source and destination inodes of reflink operations whenever a new extent (created in the current transaction) becomes shared by the inodes. This field is kept in memory only, not persisted in the inode item, similar to other existing fields (last_unlink_trans, logged_trans). When logging checksums for an extent, if the value of 'last_reflink_trans' is smaller then the current transaction's generation/id, we skip locking the extent range and deletion of checksums from the log tree, since we know we do not have new shared extents. This reduces contention on the log tree's leaves where checksums are stored. The following script, which uses fio, was used to measure the impact of this change: $ cat test-fsync.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 cat <<EOF > /tmp/fio-job.ini [writers] rw=write fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=64k ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "Using config:" echo cat /tmp/fio-job.ini echo mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were performed for different numbers of jobs, file sizes and fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has 12 cores, with cpu governance set to performance mode on all cores), 16GiB of ram (the host has 64GiB) and using a NVMe device directly (without an intermediary filesystem in the host). While running the tests, the host was not used for anything else, to avoid disturbing the tests. The obtained results were the following (the last line of fio's output was pasted). Starting with 16 jobs is where a significant difference is observable in this particular setup and hardware (differences highlighted below). The very small differences for tests with less than 16 jobs are possibly just noise and random. **** 1 job, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec after this change: WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec **** 2 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec after this change: WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec **** 4 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec after this change: WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec **** 8 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec after this change: WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec **** 16 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec after this change: WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec (+40.1% throughput, -28.5% runtime) **** 32 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec after this change: WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec (+29.4% throughput, -22.7% runtime) **** 64 jobs, file size 512M, fsync frequency 1 **** before this change: WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec after this change: WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec (+29.3% throughput, -22.7% runtime) **** 128 jobs, file size 256M, fsync frequency 1 **** before this change: WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec after this change: WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec (+121.2% throughput, -54.6% runtime) **** 256 jobs, file size 256M, fsync frequency 1 **** before this change: WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec after this change: WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec (+74.9% throughput, -42.8% runtime) **** 512 jobs, file size 128M, fsync frequency 1 **** before this change: WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec after this change: WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec (+65.1% throughput, -39.3% runtime) **** 1024 jobs, file size 128M, fsync frequency 1 **** before this change: WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec after this change: WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec (+48.7% throughput, -33.1% runtime) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e0fa5d7 |
|
09-Jun-2020 |
David Sterba <dsterba@suse.com> |
Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK" This reverts commit 5f008163a559d566a0ee1190a0a24f3eec6f1ea7. The patch is a simplification after direct IO port to iomap infrastructure, which gets reverted. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f008163 |
|
04-Dec-2019 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Since we now perform direct reads using i_rwsem, we can remove this inode flag used to co-ordinate unlocked reads. The truncate call takes i_rwsem. This means it is correctly synchronized with concurrent direct reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
769b4f24 |
|
16-Apr-2020 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of one layer of bios in direct I/O In the worst case, there are _4_ layers of bios in the Btrfs direct I/O path: 1. The bio created by the generic direct I/O code (dio_bio). 2. A clone of dio_bio we create in btrfs_submit_direct() to represent the entire direct I/O range (orig_bio). 3. A partial clone of orig_bio limited to the size of a RAID stripe that we create in btrfs_submit_direct_hook(). 4. Clones of each of those split bios for each RAID stripe that we create in btrfs_map_bio(). As of the previous commit, the second layer (orig_bio) is no longer needed for anything: we can split dio_bio instead, and complete dio_bio directly when all of the cloned bios complete. This lets us clean up a bunch of cruft, including dip->subio_endio and dip->errors (we can use dio_bio->bi_status instead). It also enables the next big cleanup of direct I/O read repair. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
85879573 |
|
16-Apr-2020 |
Omar Sandoval <osandov@fb.com> |
btrfs: put direct I/O checksums in btrfs_dio_private instead of bio The next commit will get rid of btrfs_dio_private->orig_bio. The only thing we really need it for is containing all of the checksums, but we can easily put the checksum array in btrfs_dio_private and have the submitted bios reference the array. We can also look the checksums up while we're setting up instead of the current awkward logic that looks them up for orig_bio when the first split bio is submitted. (Interestingly, btrfs_dio_private did contain the checksums before commit 23ea8e5a0767 ("Btrfs: load checksum data once when submitting a direct read io"), but it didn't look them up up front.) Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3b318d1 |
|
16-Apr-2020 |
Omar Sandoval <osandov@fb.com> |
btrfs: convert btrfs_dio_private->pending_bios to refcount_t This is really a reference count now, so convert it to refcount_t and rename it to refs. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2390a6da |
|
16-Apr-2020 |
Omar Sandoval <osandov@fb.com> |
btrfs: remove unused btrfs_dio_private::private We haven't used this since commit 9be3395bcd4a ("Btrfs: use a btrfs bioset instead of abusing bio internals"). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
41a2ee75 |
|
17-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce per-inode file extent tree In order to keep track of where we have file extents on disk, and thus where it is safe to adjust the i_size to, we need to have a tree in place to keep track of the contiguous areas we have file extents for. Add helpers to use this tree, as it's not required for NO_HOLES file systems. We will use this by setting DIRTY for areas we know we have file extent item's set, and clearing it when we remove file extent items for truncation. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16ad3be1 |
|
25-Oct-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove unnecessary delalloc mutex for inodes The inode delalloc mutex was added a long time ago by commit f248679e86fea ("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and the reason for its introduction is not very clear from the change log. It claims it solves bogus warnings from lockdep, however it lacks an example report/warning from lockdep, or any explanation. Since we have enough concurrentcy protection from the locks of the space info and block reserve objects, and such lockdep warnings don't seem to exist anymore (at least on a 5.3 kernel I couldn't get them with fstests, ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing the size of the btrfs_inode structure. With some quick fio tests doing direct IO and mmap writes I couldn't observe any significant performance increase either (direct IO writes that don't increase the file's size don't hold the inode's lock for their entire duration and mmap writes don't hold the inode's lock at all), which are the only type of writes that could see any performance gain due to less serialization. Review feedback from Josef: The problem was taking the i_mutex in mmap, which is how I was protecting delalloc reservations originally. The delalloc mutex didn't come with all of the other dependencies. That's what the lockdep messages were about, removing the lock isn't going to make them appear again. We _had_ to lock around this because we used to do tricks to keep from over-reserving, and if we didn't serialize delalloc reservations we'd end up with ugly accounting problems when we tried to clean things up. However with my recentish changes this isn't the case anymore. Every operation is responsible for reserving its space, and then adding it to the inode. Then cleaning up is straightforward and can't be mucked up by other users. So we no longer need the delalloc mutex to safe us from ourselves. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ea41d6b2 |
|
03-Jun-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: remove assumption about csum type form btrfs_print_data_csum_error() btrfs_print_data_csum_error() still assumed checksums to be 32 bit in size. Make it size agnostic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7ebc7e5f |
|
03-Jun-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: format checksums according to type for printing Add a small helper for btrfs_print_data_csum_error() which formats the checksum according to it's type for pretty printing. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> [ shorten macro name ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b8aa330d |
|
17-Apr-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: improve performance on fsync of files with multiple hardlinks Commit 41bd6067692382 ("Btrfs: fix fsync of files with multiple hard links in new directories") introduced a path that makes fsync fallback to a full transaction commit in order to avoid losing hard links and new ancestors of the fsynced inode. That path is triggered only when the inode has more than one hard link and either has a new hard link created in the current transaction or the inode was evicted and reloaded in the current transaction. That path ends up getting triggered very often (hundreds of times) during the course of pgbench benchmarks, resulting in performance drops of about 20%. This change restores the performance by not triggering the full transaction commit in those cases, and instead iterate the fs/subvolume tree in search of all possible new ancestors, for all hard links, to log them. Reported-by: Zhao Yuhu <zyuhu@suse.com> Tested-by: James Wang <jnwang@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7d157c3d |
|
26-Mar-2019 |
Phillip Potter <phil@philpotter.co.uk> |
btrfs: use common file type conversion Deduplicate the btrfs file type conversion implementation - file systems that use the same file types as defined by POSIX do not need to define their own versions and can use the common helper functions decared in fs_types.h and implemented in fs_types.c Common implementation can be found via commit: bbe7449e2599 "fs: common implementation of file type" Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
41bd6067 |
|
28-Nov-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix fsync of files with multiple hard links in new directories The log tree has a long standing problem that when a file is fsync'ed we only check for new ancestors, created in the current transaction, by following only the hard link for which the fsync was issued. We follow the ancestors using the VFS' dget_parent() API. This means that if we create a new link for a file in a directory that is new (or in an any other new ancestor directory) and then fsync the file using an old hard link, we end up not logging the new ancestor, and on log replay that new hard link and ancestor do not exist. In some cases, involving renames, the file will not exist at all. Example: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt mkdir /mnt/A touch /mnt/foo ln /mnt/foo /mnt/A/bar xfs_io -c fsync /mnt/foo <power failure> In this example after log replay only the hard link named 'foo' exists and directory A does not exist, which is unexpected. In other major linux filesystems, such as ext4, xfs and f2fs for example, both hard links exist and so does directory A after mounting again the filesystem. Checking if any new ancestors are new and need to be logged was added in 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"), however only for the ancestors of the hard link (dentry) for which the fsync was issued, instead of checking for all ancestors for all of the inode's hard links. So fix this by tracking the id of the last transaction where a hard link was created for an inode and then on fsync fallback to a full transaction commit when an inode has more than one hard link and at least one new hard link was created in the current transaction. This is the simplest solution since this is not a common use case (adding frequently hard links for which there's an ancestor created in the current transaction and then fsync the file). In case it ever becomes a common use case, a solution that consists of iterating the fs/subvol btree for each hard link and check if any ancestor is new, could be implemented. This solves many unexpected scenarios reported by Jayashree Mohan and Vijay Chidambaram, and for which there is a new test case for fstests under review. Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Reported-by: Vijay Chidambaram <vvijay03@gmail.com> Reported-by: Jayashree Mohan <jayashree2912@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bbe339cc |
|
27-Nov-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop extra enum initialization where using defaults The first auto-assigned value to enum is 0, we can use that and not initialize all members where the auto-increment does the same. This is used for values that are not part of on-disk format. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cd24c69 |
|
01-Nov-2018 |
Ethan Lien <ethanlien@synology.com> |
btrfs: use tagged writepage to mitigate livelock of snapshot Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
06f2548f |
|
09-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Add function to distinguish between data and btree inode This will be used in future patches that remove the optional extent_io_ops callbacks. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4fd786e6 |
|
05-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Remove 'objectid' member from struct btrfs_root There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3c6be6f |
|
21-Jun-2018 |
Arnd Bergmann <arnd@arndb.de> |
btrfs: use timespec64 for i_otime While the regular inode timestamps all use timespec64 now, the i_otime field is btrfs specific and still needs to be converted to correctly represent times beyond 2038. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7efc3e34 |
|
11-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enums We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make them consecutive again. Signed-off-by: Omar Sandoval <osandov@fb.com> [ switch them enums so we don't have to do that again ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a575ceeb |
|
11-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: get rid of unused orphan infrastructure Now that we don't keep long-standing reservations for orphan items, root->orphan_block_rsv isn't used. We can git rid of it, along with: - root->orphan_lock, which was used to protect root->orphan_block_rsv - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations in root->orphan_block_rsv - btrfs_orphan_commit_root(), which was the last user of any of these and does nothing else Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b40b695 |
|
11-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM Now that we don't add orphan items for truncate, there can't be races on adding or deleting an orphan item, so this bit is unnecessary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9888c340 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- headers Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Unify the include protection macros to match the file names. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
051c98eb |
|
07-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: open code trivial helper btrfs_page_exists_in_range The called function name is self explanatory. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
965aab1c |
|
06-Mar-2018 |
Matthew Wilcox <willy@infradead.org> |
btrfs: Use filemap_range_has_page() The current implementation of btrfs_page_exists_in_range() gives the wrong answer if the workingset code has stored a shadow entry in the page cache. The filemap_range_has_page() function does not have this problem, and it's shared code, so use it instead. eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1c3fac2 |
|
16-Jan-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove btrfs_inode::delayed_iput_count delayed_iput_count wa supposed to be used to implement, well, delayed iput. The idea is that we keep accumulating the number of iputs we do until eventually the inode is deleted. Turns out we never really switched the delayed_iput_count from 0 to 1, hence all conditional code relying on the value of that member being different than 0 was never executed. This, as it turns out, didn't cause any problem due to the simple fact that the generic inode's i_count member was always used to count the number of iputs. So let's just remove the unused member and all unused code. This patch essentially provides no functional changes. While at it, also add proper documentation for btrfs_add_delayed_iput Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69fe2d75 |
|
19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make the delalloc block rsv per inode The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dd48d407 |
|
19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add tracepoints for outstanding extents mods This is handy for tracing problems with modifying the outstanding extents counters. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b62f87b |
|
19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
Btrfs: rework outstanding_extents Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eec63c65 |
|
17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: separate defrag and property compression Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-file 'compression' was set and a defrag -c was called. The property-compression is loaded when the file is open, defrag will overwrite the same variable and reset to 0 (ie. NONE) at when the file defragmentaion is finished. That's considered a usability bug. Now we won't touch the property value, use the defrag-compression. The precedence of defrag is higher than for property (and whole-filesystem). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b52aa8c9 |
|
17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: rename variable holding per-inode compression type This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag will reset compression type to NONE. If the file has compression set via property, it will not apply anymore (until next mount or reset through command line). We're going to fix that by adding another variable just for the defrag call and won't touch the property. The defrag will have higher priority when deciding whether to compress the data. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a35b637 |
|
28-Jun-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: constify tracepoint arguments Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e4cbee9 |
|
03-Jun-2017 |
Christoph Hellwig <hch@lst.de> |
block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
a7e3b975 |
|
03-Apr-2017 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix reported number of inode blocks Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes. Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec) # The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1 $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec) # The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2 So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed. Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode. Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch. Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
0b581701 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
abcefb1e |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0970a22e |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_print_data_csum_error take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70ddc553 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_is_free_space_inode take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ef06d27 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_i_size_write take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f6b643e |
|
08-Feb-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Better csum error message for data csum mismatch The original csum error message only outputs inode number, offset, check sum and expected check sum. However no root objectid is outputted, which sometimes makes debugging quite painful under multi-subvolume case (including relocation). Also the checksum output is decimal, which seldom makes sense for users/developers and is hard to read in most time. This patch will add root objectid, which will be %lld for rootid larger than LAST_FREE_OBJECTID, and hex csum output for better readability. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0f8939b8 |
|
17-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_inode_in_log take btrfs_inode Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4a0cc7ca |
|
10-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_ino take a struct btrfs_inode Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
afcdd129 |
|
02-Sep-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add a flags field to btrfs_fs_info We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
01327610 |
|
19-May-2016 |
Nicholas D Steeves <nsteeves@gmail.com> |
btrfs: fix string and comment grammatical issues and typos Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f9a8a51 |
|
12-May-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: add semaphore to synchronize direct IO writes with fsync Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths. This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
8089fe62 |
|
19-Nov-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: put delayed item hook into inode Inodes for delayed iput allocate a trivial helper structure, let's place the list hook directly into the inode and save a kmalloc (killing a __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode. The inode can be put into the delayed_iputs list more than once and we have to keep the count. This means we can't use the list_splice to process a bunch of inodes because we'd lost track of the count if the inode is put into the delayed iputs again while it's processed. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50745b0a |
|
28-Aug-2015 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: Direct I/O: Fix space accounting The following call trace is seen when generic/095 test is executed, WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0() Modules linked in: CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0 Call Trace: [<ffffffff81984058>] dump_stack+0x45/0x57 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0 [<ffffffff8117ce07>] destroy_inode+0x37/0x60 [<ffffffff8117cf39>] evict+0x109/0x170 [<ffffffff8117cfd5>] dispose_list+0x35/0x50 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0 [<ffffffff81165951>] kill_anon_super+0x11/0x20 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70 [<ffffffff811660cf>] deactivate_super+0x5f/0x70 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10 [<ffffffff81069c06>] task_work_run+0x96/0xb0 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50 [<ffffffff8198cbc2>] int_signal+0x12/0x17 This means that the inode had non-zero "outstanding extents" during eviction. This occurs because, during direct I/O a task which successfully used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does not clear the bit after finishing the DIO write. A future DIO write could actually fail and the unused reserve space won't be freed because of the previously set BTRFS_INODE_DIO_READY bit. Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the following issue, |-----------------------------------+-------------------------------------| | Task A | Task B | |-----------------------------------+-------------------------------------| | Start direct i/o write on inode X.| | | reserve space | | | Allocate ordered extent | | | release reserved space | | | Set BTRFS_INODE_DIO_READY bit. | | | | splice() | | | Transfer data from pipe buffer to | | | destination file. | | | - kmap(pipe buffer page) | | | - Start direct i/o write on | | | inode X. | | | - reserve space | | | - dio_refill_pages() | | | - sdio->blocks_available == 0 | | | - Since a kernel address is | | | being passed instead of a | | | user space address, | | | iov_iter_get_pages() returns | | | -EFAULT. | | | - Since BTRFS_INODE_DIO_READY is | | | set, we don't release reserved | | | space. | | | - Clear BTRFS_INODE_DIO_READY bit.| | -EIOCBQUEUED is returned. | | |-----------------------------------+-------------------------------------| Hence this commit introduces "struct btrfs_dio_data" to track the usage of reserved data space. The remaining unused "reserve space" can now be freed reliably. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ddba1bfc |
|
17-Jun-2015 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix warning of bytes_may_use While running generic/019, dmesg got several warnings from btrfs_free_reserved_data_space(). Test generic/019 produces some disk failures so sumbit dio will get errors, in which case, btrfs_direct_IO() goes to the error handling and free bytes_may_use, but the problem is that bytes_may_use has been free'd during get_block(). This adds a runtime flag to show if we've gone through get_block(), if so, don't do the cleanup work. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2f2ff0ee |
|
20-Mar-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix metadata inconsistencies after directory fsync We can get into inconsistency between inodes and directory entries after fsyncing a directory. The issue is that while a directory gets the new dentries persisted in the fsync log and replayed at mount time, the link count of the inode that directory entries point to doesn't get updated, staying with an incorrect link count (smaller then the correct value). This later leads to stale file handle errors when accessing (including attempt to delete) some of the links if all the other ones are removed, which also implies impossibility to delete the parent directories, since the dentries can not be removed. Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2), when fsyncing a directory, new files aren't logged (their metadata and dentries) nor any child directories. So this patch fixes this issue too, since it has the same resolution as the incorrect inode link count issue mentioned before. This is very easy to reproduce, and the following excerpt from my test case for xfstests shows how: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file and directory. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io mkdir $SCRATCH_MNT/mydir # Make sure all metadata and data are durably persisted. sync # Add a hard link to 'foo' inside our test directory and fsync only the # directory. The btrfs fsync implementation had a bug that caused the new # directory entry to be visible after the fsync log replay but, the inode # of our file remained with a link count of 1. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2 # Add a few more links and new files. # This is just to verify nothing breaks or gives incorrect results after the # fsync log is replayed. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3 $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2 # Add some subdirectories and new files and links to them. This is to verify # that after fsyncing our top level directory 'mydir', all the subdirectories # and their files/links are registered in the fsync log and exist after the # fsync log is replayed. mkdir -p $SCRATCH_MNT/mydir/x/y/z ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link touch $SCRATCH_MNT/mydir/x/y/z/qwerty # Now fsync only our top directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir # And fsync now our new file named 'hello', just to verify later that it has # the expected content and that the previous fsync on the directory 'mydir' had # no bad influence on this fsync. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Verify the content of our file 'foo' remains the same as before, 8192 bytes, # all with the value 0xaa. echo "File 'foo' content after log replay:" od -t x1 $SCRATCH_MNT/foo # Remove the first name of our inode. Because of the directory fsync bug, the # inode's link count was 1 instead of 5, so removing the 'foo' name ended up # deleting the inode and the other names became stale directory entries (still # visible to applications). Attempting to remove or access the remaining # dentries pointing to that inode resulted in stale file handle errors and # made it impossible to remove the parent directories since it was impossible # for them to become empty. echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)" rm -f $SCRATCH_MNT/foo # Now verify that all files, links and directories created before fsyncing our # directory exist after the fsync log was replayed. [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing" [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing" [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing" [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing" [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \ echo "Link mydir/x/y/foo_y_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \ echo "Link mydir/x/y/z/foo_z_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \ echo "File mydir/x/y/z/qwerty is missing" # We expect our file here to have a size of 64Kb and all the bytes having the # value 0xff. echo "file 'hello' content after log replay:" od -t x1 $SCRATCH_MNT/hello # Now remove all files/links, under our test directory 'mydir', and verify we # can remove all the directories. rm -f $SCRATCH_MNT/mydir/x/y/z/* rmdir $SCRATCH_MNT/mydir/x/y/z rm -f $SCRATCH_MNT/mydir/x/y/* rmdir $SCRATCH_MNT/mydir/x/y rmdir $SCRATCH_MNT/mydir/x rm -f $SCRATCH_MNT/mydir/* rmdir $SCRATCH_MNT/mydir # An fsck, run by the fstests framework everytime a test finishes, also detected # the inconsistency and printed the following error message: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref # unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref status=0 exit The expected golden output for the test is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 5 file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 Which is the output after this patch and when running the test against ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's output is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 1 Link mydir/foo_2 is missing Link mydir/foo_3 is missing Link mydir/x/y/foo_y_link is missing Link mydir/x/y/z/foo_z_link is missing File mydir/x/y/z/qwerty is missing file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty Fsck, without this fix, also complains about the wrong link count: root 5 inode 257 errors 2001, no inode item, link count wrong unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref So fix this by logging the inodes that the dentries point to when fsyncing a directory. A test case for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9cc97d64 |
|
03-Jul-2012 |
chandan r <chandanrmail@gmail.com> |
Btrfs: Add code to support file creation time This patch adds a new member to the 'struct btrfs_inode' structure to hold the file creation time. Signed-off-by: chandan <chandanrmail@gmail.com> [refreshed, removed btrfs_inode_otime] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
656f30db |
|
25-Sep-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: be aware of btree inode write errors to avoid fs corruption While we have a transaction ongoing, the VM might decide at any time to call btree_inode->i_mapping->a_ops->writepages(), which will start writeback of dirty pages belonging to btree nodes/leafs. This call might return an error or the writeback might finish with an error before we attempt to commit the running transaction. If this happens, we might have no way of knowing that such error happened when we are committing the transaction - because the pages might no longer be marked dirty nor tagged for writeback (if a subsequent modification to the extent buffer didn't happen before the transaction commit) which makes filemap_fdata[write|wait]_range unable to find such pages (even if they're marked with SetPageError). So if this happens we must abort the transaction, otherwise we commit a super block with btree roots that point to btree nodes/leafs whose content on disk is invalid - either garbage or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like "parent transid verify failed on 10826481664 wanted 25748 found 29562" when reading btree nodes/leafs from disk). Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's i_mapping would not be enough because we need to distinguish between log tree extents (not fatal) vs non-log tree extents (fatal) and because the next call to filemap_fdatawait_range() will catch and clear such errors in the mapping - and that call might be from a log sync and not from a transaction commit, which means we would not know about the error at transaction commit time. Also, checking for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't done and would not be completely reliable, as the eb might be removed from memory and read back when trying to get it, which clears that flag right before reading the eb's pages from disk, making us not know about the previous write error. Using the new 3 flags for the btree inode also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages() returns success, started writeback for all dirty pages and before filemap_fdatawait_range() is called, the writeback for all dirty pages had already finished with errors - because we were not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as it could not know that writeback errors happened (the pages were no longer tagged for writeback). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8b110e39 |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: implement repair function when direct read fails This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c1dc0896 |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: do file data check by sub-bio's self Direct IO splits the original bio to several sub-bios because of the limit of raid stripe, and the filesystem will wait for all sub-bios and then run final end io process. But it was very hard to implement the data repair when dio read failure happens, because at the final end io function, we didn't know which mirror the data was read from. So in order to implement the data repair, we have to move the file data check in the final end io function to the sub-bio end io function, in which we can get the mirror number of the device we access. This patch did this work as the first step of the direct io data repair implementation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
23ea8e5a |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: load checksum data once when submitting a direct read io The current code would load checksum data for several times when we split a whole direct read io because of the limit of the raid stripe, it would make us search the csum tree for several times. In fact, it just wasted time, and made the contention of the csum tree root be more serious. This patch improves this problem by loading the data at once. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
47059d93 |
|
03-Jul-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: make defragment work with nodatacow option Btrfs defragment will utilize COW feature, which means this did not work for nodatacow option, this problem was detected by xfstests generic/018 with nodatacow mount option. Fix this problem by forcing cow for a extent with state @EXTETN_DEFRAG setting. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
125c4cf9 |
|
11-Sep-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: set inode's logged_trans/last_log_commit after ranged fsync When a ranged fsync finishes if there are still extent maps in the modified list, still set the inode's logged_trans and last_log_commit. This is important in case an inode is fsync'ed and unlinked in the same transaction, to ensure its inode ref gets deleted from the log and the respective dentries in its parent are deleted too from the log (if the parent directory was fsync'ed in the same transaction). Instead make btrfs_inode_in_log() return false if the list of modified extent maps isn't empty. This is an incremental on top of the v4 version of the patch: "Btrfs: fix fsync data loss after a ranged fsync" which was added to its v5, but didn't make it on time. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8d875f95 |
|
12-Aug-2014 |
Chris Mason <clm@fb.com> |
btrfs: disable strict file flushes for renames and truncates Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: Chris Mason <clm@fb.com>
|
#
fc4adbff |
|
20-May-2014 |
Alex Gartrell <agartrell@fb.com> |
btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking In these instances, we are trying to determine if a page has been accessed since we began the operation for the sake of retry. This is easily accomplished by doing a gang lookup in the page mapping radix tree, and it saves us the dependency on the flag (so that we might eventually delete it). btrfs_page_exists_in_range borrows heavily from find_get_page, replacing the radix tree look up with a gang lookup of 1, so that we can find the next highest page >= index and see if it falls into our lock range. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Alex Gartrell <agartrell@fb.com>
|
#
4e857c58 |
|
17-Mar-2014 |
Peter Zijlstra <peterz@infradead.org> |
arch: Mass conversion of smp_mb__*() Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
bb14a59b |
|
20-Feb-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use signed integer instead of unsigned long integer for log transid The log trans id is initialized to be 0 every time we create a log tree, and the log tree need be re-created after a new transaction is started, it means the log trans id is unlikely to be a huge number, so we can use signed integer instead of unsigned long integer to save a bit space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
63541927 |
|
07-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: add support for inode properties This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: *) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; *) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
67de1176 |
|
25-Dec-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce the delayed inode ref deletion for the single link inode The inode reference item is close to inode item, so we insert it simultaneously with the inode item insertion when we create a file/directory.. In fact, we also can handle the inode reference deletion by the same way. So we made this patch to introduce the delayed inode reference deletion for the single link inode(At most case, the file doesn't has hard link, so we don't take the hard link into account). This function is based on the delayed inode mechanism. After applying this patch, we can reduce the time of the file/directory deletion by ~10%. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
778ba82b |
|
06-Oct-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: improve inode hash function/inode lookup Currently the hash value used for adding an inode to the VFS's inode hash table consists of the plain inode number, which is a 64 bits integer. This results in hash table buckets (hlist_head lists) with too many elements for at least 2 important scenarios: 1) When we have many subvolumes. Each subvolume has its own btree where its files and directories are added to, and each has its own objectid (inode number) namespace. This means that if we have N subvolumes, and all have inode number X associated to a file or directory, the corresponding inodes all map to the same hash table entry, resulting in a bucket (hlist_head list) with N elements; 2) On 32 bits machines. Th VFS hash values are unsigned longs, which are 32 bits wide on 32 bits machines, and the inode (objectid) numbers are 64 bits unsigned integers. We simply cast the inode numbers to hash values, which means that for all inodes with the same 32 bits lower half, the same hash bucket is used for all of them. For example, all inodes with a number (objectid) between 0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in the same hash table bucket. This change ensures the inode's hash value depends both on the objectid (inode number) and its subvolume's (btree root) objectid. For 32 bits machines, this change gives better entropy by making the hash value depend on both the upper and lower 32 bits of the 64 bits hash previously computed. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
a5874ce6 |
|
11-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: check roots last log commit when checking if an inode has been logged Liu introduced a local copy of the last log commit for an inode to make sure we actually log an inode even if a log commit has already taken place. In order to make sure we didn't relog the same inode multiple times he set this local copy to the current trans when we log the inode, because usually we log the inode and then sync the log. The exception to this is during rename, we will relog an inode if the name changed and it is already in the log. The problem with this is then we go to sync the inode, and our check to see if the inode has already been logged is tripped and we don't sync the log. To fix this we need to _also_ check against the roots last log commit, because it could be less than what is in our local copy of the log commit. This fixes a bug where we rename a file into a directory and then fsync the directory and then on remount the directory is no longer there. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
facc8a22 |
|
25-Jul-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't cache the csum value into the extent state tree Before applying this patch, we cached the csum value into the extent state tree when reading some data from the disk, this operation increased the lock contention of the state tree. Now, we just store the csum value into the bio structure or other unshared structure, so we can reduce the lock contention. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9c931c5a |
|
14-Apr-2013 |
Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com> |
btrfs: fix minor typo in comment In the comment describing the sync_writers field of the btrfs_inode struct, "fsyncing" was misspelled "fsycing." Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2e60a51e |
|
08-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: serialize unlocked dio reads with truncate Currently, we can do unlocked dio reads, but the following race is possible: dio_read_task truncate_task ->btrfs_setattr() ->btrfs_direct_IO ->__blockdev_direct_IO ->btrfs_get_block ->btrfs_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to avoid this problem, we must serialize unlocked dio reads with truncate. There are two approaches: - use extent lock to protect the extent that we truncate - use inode_dio_wait() to make sure the truncating task will wait for the read DIO. If we use the 1st one, we will meet the endless truncation problem due to the nonlocked read DIO after we implement the nonlocked write DIO. It is because we still need invoke inode_dio_wait() avoid the race between write DIO and truncation. By that time, we have to introduce btrfs_inode_{block, resume}_nolock_dio() again. That is we have to implement this patch again, so I choose the 2nd way to fix the problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
df0af1a5 |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use the inode own lock to protect its delalloc_bytes We need not use a global lock to protect the delalloc_bytes of the inode, just use its own lock. In this way, we can reduce the lock contention and ->delalloc_lock will just protect delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b812ce28 |
|
16-Nov-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: inline csums if we're fsyncing The tree logging stuff needs the csums to be on the ordered extents in order to log them properly, so mark that we're sync and inline the csum creation so we don't have to wait on the csumming to be done when logging extents that are still in flight. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e9976151 |
|
11-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: only log the inode item if we can get away with it Currently we copy all the file information into the log, inode item, the refs, xattrs etc. Except most of this doesn't change from fsync to fsync, just the inode item changes. So set a flag if an xattr changes or a link is added, and otherwise only log the inode item. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
46d8bc34 |
|
29-Aug-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix a bug in checking whether a inode is already in log This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
5dc562c5 |
|
17-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b4d7c3c9 |
|
09-Jul-2012 |
Li Zefan <lizefan@huawei.com> |
Btrfs: kill free_space pointer from inode structure Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type, which means every inode has the same BTRFS_I(inode)->free_space pointer. This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits). Signed-off-by: Li Zefan <lizefan@huawei.com>
|
#
83eea1f1 |
|
10-Jul-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: kill root from btrfs_is_free_space_inode Since root can be fetched via BTRFS_I macro directly, we can save an args for btrfs_is_free_space_inode(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
51a8cf9d |
|
10-Jul-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix btrfs_is_free_space_inode to recognize btree inode For btree inode, its root is also 'tree root', so btree inode can be misunderstood as a free space inode. We should add one more check for btree inode. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
7ddf5a42 |
|
08-Jun-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: call filemap_fdatawrite twice for compression I removed this in an earlier commit and I was wrong. Because compression can return from filemap_fdatawrite() without having actually set any of it's pages as writeback() it can make filemap_fdatawait() do essentially nothing, and then we won't find any ordered extents because they may not have been created yet. So not only does this make fsync() completely useless, but it will also screw up if you truncate on a non-page aligned offset since we zero out the end and then wait on ordered extents and then call drop caches. We can drop the cache before the io completes and then we try to unpin the extent we just wrote we won't find it and everything goes sideways. So fix this by putting it back and put a giant comment there to keep me from trying to remove it in the future. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
22ee6985d |
|
29-May-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: check to see if the inode is in the log before fsyncing We have this check down in the actual logging code, but this is after we start a transaction and all that good stuff. So move the helper inode_in_log() out so we can call it in fsync() and avoid starting a transaction altogether and just exit if we've already fsync()'ed this file recently. You would notice this issue if you fsync()'ed a file over and over again until the transaction committed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
8a35d95f |
|
23-May-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
72ac3c0d |
|
23-May-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: convert the inode bit field to use the actual bit operations Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
0c4d2d95 |
|
05-Apr-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: use i_version instead of our own sequence We've been keeping around the inode sequence number in hopes that somebody would use it, but nobody uses it and people actually use i_version which serves the same purpose, so use i_version where we used the incore inode's sequence number and that way the sequence is updated properly across the board, and not just in file write. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
f248679e |
|
12-Jan-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: add a delalloc mutex to inodes for delalloc reservations I was using i_mutex for this, but we're getting bogus lockdep warnings by doing that and theres no real way to get rid of those, so just stop using i_mutex to protect delalloc metadata reservations and use a delalloc mutex instead. This shouldn't be contended often at all, only if you are writing and mmap writing to the file at the same time. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
7fd2ae21 |
|
08-Nov-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix our reservations for updating an inode when completing io People have been reporting ENOSPC crashes in finish_ordered_io. This is because we try to steal from the delalloc block rsv to satisfy a reservation to update the inode. The problem with this is we don't explicitly save space for updating the inode when doing delalloc. This is kind of a problem and we've gotten away with this because way back when we just stole from the delalloc reserve without any questions, and this worked out fine because generally speaking the leaf had been modified either by the mtime update when we did the original write or because we just updated the leaf when we inserted the file extent item, only on rare occasions had the leaf not actually been modified, and that was still ok because we'd just use a block or two out of the over-reservation that is delalloc. Then came the delayed inode stuff. This is amazing, except it wants a full reservation for updating the inode since it may do it at some point down the road after we've written the blocks and we have to recow everything again. This worked out because the delayed inode stuff just stole from the global reserve, that is until recently when I changed that because it caused other problems. So here we are, we're doing everything right and being screwed for it. So take an extra reservation for the inode at delalloc reservation time and carry it through the life of the delalloc reservation. If we need it we can steal it in the delayed inode stuff. If we have already stolen it try and do a normal metadata reservation. If that fails try to steal from the delalloc reservation. If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to solve this and in the meantime we'll steal from the global reserve. With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see any problems. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7709cde3 |
|
04-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: calculate checksum space correctly We have not been reserving enough space for checksums. We were just reserving bytes for the checksum items themselves, we were not taking into account having to cow the tree and such. This patch adds a csum_bytes counter to the inode for keeping track of the number of bytes outstanding we have for checksums. Then we calculate how many leaves would be required for the checksums we are given and use that to reserve space. This adds a significant amount of bytes to our reservations, but we will handle this later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
0cbbdf7c |
|
14-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill reserved_bytes in inode reserved_bytes is not used for anything in the inode, remove it. Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
f1bdcc0a |
|
14-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: move stuff around in btrfs_inode to get better packing Moving things around to give us better packing in the btrfs_inode. This reduces the size of our inode by 8 bytes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
14c7cca7 |
|
11-Sep-2011 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix an oops when deleting snapshots We can reproduce this oops via the following steps: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs $ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done $ rm -fr /mnt/btrfs/* $ rm -fr /mnt/btrfs/* then we'll get ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:2264! [...] Call Trace: [<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs] [<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0 [<ffffffff81153cc3>] do_rmdir+0x123/0x140 [<ffffffff81145ac7>] ? fput+0x197/0x260 [<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0 [<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40 [<ffffffff8147896b>] system_call_fastpath+0x16/0x1b RIP [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs] When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID, while the snapshot's location.objectid remains unchanged. However, btrfs_ino() does not take this into account, and returns a wrong ino, and causes the oops. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2cf8572d |
|
26-Jul-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use the commit_root for reading free_space_inode crcs Now that we are using regular file crcs for the free space cache, we can deadlock if we try to read the free_space_inode while we are updating the crc tree. This commit fixes things by using the commit_root to read the crcs. This is safe because we the free space cache file would already be loaded if that block group had been changed in the current transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9e0baf60 |
|
15-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix enospc problems with delalloc So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4cb5300b |
|
24-May-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add mount -o auto_defrag This will detect small random writes into files and queue the up for an auto defrag process. It isn't well suited to database workloads yet, but works for smaller files such as rpm, sqlite or bdb databases. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d82a6f1d |
|
11-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill BTRFS_I(inode)->block_group Originally this was going to be used as a way to give hints to the allocator, but frankly we can get much better hints elsewhere and it's not even used at all for anything usefull. In addition to be completely useless, when we initialize an inode we try and find a freeish block group to set as the inodes block group, and with a completely full 40gb fs this takes _forever_, so I imagine with say 1tb fs this is just unbearable. So just axe the thing altoghether, we don't need it and it saves us 8 bytes in the inode and saves us 500 microseconds per inode lookup in my testcase. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
16cdcec7 |
|
22-Apr-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: implement delayed inode items operation Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
33345d01 |
|
19-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Always use 64bit inode number There's a potential problem in 32bit system when we exhaust 32bit inode numbers and start to allocate big inode numbers, because btrfs uses inode->i_ino in many places. So here we always use BTRFS_I(inode)->location.objectid, which is an u64 variable. There are 2 exceptions that BTRFS_I(inode)->location.objectid != inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2), and inode->i_ino will be used in those cases. Another reason to make this change is I'm going to use a special inode to save free ino cache, and the inode number must be > (u64)-256. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
57a45ced |
|
25-Jan-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: change reserved_extents to an atomic_t We track delayed allocation per inodes via 2 counters, one is outstanding_extents and reserved_extents. Outstanding_extents is already an atomic_t, but reserved_extents is not and is protected by a spinlock. So convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg when releasing delalloc bytes. This makes our inode 72 bytes smaller, and reduces locking overhead (albiet it was minimal to begin with). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
261507a0 |
|
16-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Allow to add new compression algorithm Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
d68fc57b |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Metadata reservation for orphan inodes reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ca1f7ce |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Update metadata reservation for delayed allocation Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1e701a32 |
|
11-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add new defrag-range ioctl. The btrfs defrag ioctl was limited to doing the entire file. This commit adds a new interface that can defrag a specific range inside the file. It can also force compression on the file, allowing you to selectively compress individual files after they were created, even when mount -o compress isn't turned on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c2167754 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Fix disk_i_size update corner case There are some cases file extents are inserted without involving ordered struct. In these cases, we update disk_i_size directly, without checking pending ordered extent and DELALLOC bit. This patch extends btrfs_ordered_update_i_size() to handle these cases. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
257c62e1 |
|
13-Oct-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: avoid tree log commit when there are no changes rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
32c00aff |
|
08-Oct-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: release delalloc reservations on extent item insertion This patch fixes an issue with the delalloc metadata space reservation code. The problem is we used to free the reservation as soon as we allocated the delalloc region. The problem with this is if we are not inserting an inline extent, we don't actually insert the extent item until after the ordered extent is written out. This patch does 3 things, 1) It moves the reservation clearing stuff into the ordered code, so when we remove the ordered extent we remove the reservation. 2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear delalloc bits in the cases where we want to clear the metadata reservation when we clear the delalloc extent, in the case that we do an inline extent or we invalidate the page. 3) It adds another waitqueue to the space info so that when we start a fs wide delalloc flush, anybody else who also hits that area will simply wait for the flush to finish and then try to make their allocation. This has been tested thoroughly to make sure we did not regress on performance. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9ed74f2d |
|
11-Sep-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: proper -ENOSPC handling At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4df27c4d |
|
21-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: change how subvolumes are organized btrfs allows subvolumes and snapshots anywhere in the directory tree. If we snapshot a subvolume that contains a link to other subvolume called subvolA, subvolA can be accessed through both the original subvolume and the snapshot. This is similar to creating hard link to directory, and has the very similar problems. The aim of this patch is enforcing there is only one access point to each subvolume. Only the first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point. The first directory entry is distinguished by checking root forward reference. If the corresponding root forward reference is missing, we know the entry is not the first one. This patch also adds snapshot/subvolume rename support, the code allows rename subvolume link across subvolumes. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5affd88a |
|
08-Jun-2009 |
Al Viro <viro@zeniv.linux.org.uk> |
switch btrfs to inode->i_acl Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6cbff00f |
|
17-Apr-2009 |
Christoph Hellwig <hch@lst.de> |
Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION Add support for the standard attributes set via chattr and read via lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros as they were confusing the existing code and got in the way of the new additions. Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's trivial. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5d4f98a2 |
|
10-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a3f23d5 |
|
31-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add extra flushing for renames and truncates Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
12fcfd22 |
|
24-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: tree logging unlink/rename fixes The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6a63209f |
|
20-Feb-2009 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: add better -ENOSPC handling This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
d2fb3437 |
|
11-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix leaking block group on balance The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
c3027eb5 |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add inode sequence number for NFS and reserved space in a few structs This adds a sequence number to the btrfs inode that is increased on every update. NFS will be able to use that to detect when an inode has changed, without relying on inaccurate time fields. While we're here, this also: Puts reserved space into the super block and inode Adds a log root transid to the super so we can pick the newest super based on the fsync log as well as the main transaction ID. For now the log root transid is always zero, but that'll get fixed. Adds a starting offset to the dev_item. This will let us do better alignment calculations if we know the start of a partition on the disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d352ac68 |
|
29-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add and improve comments This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
49eb7e46 |
|
11-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Dir fsync optimizations Drop i_mutex during the commit Don't bother doing the fsync at all unless the dir is marked as dirtied and needing fsync in this transaction. For directories, this means that someone has unlinked a file from the dir without fsyncing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e02119d5 |
|
05-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a write ahead tree log to optimize synchronous operations File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
00e4e6b3 |
|
05-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Get rid of BTRFS_I(inode)->index and use local vars instead rename and link don't always have a lock on the source inode, and our use of a per-inode index variable was racy. This changes things to store the index in a local variable instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ea8c2819 |
|
04-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7b128766 |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Create orphan inode records to prevent lost files after a crash Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
33268eaf |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Add ACL support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
aec7477b |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Implement new dir index format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ee6e6504 |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Add a per-inode lock around btrfs_drop_extents btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dbe674a9 |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Update on disk i_size only after pending ordered extents are done This changes the ordered data code to update i_size after the extent is on disk. An on disk i_size is maintained in the in-memory btrfs inode structures, and this is updated as extents finish. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e6dcd2dc |
|
16-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: New data=ordered implementation The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1b1e2135 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a per-inode csum mutex to avoid races creating csum items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
81d7ed29 |
|
25-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Throttle file_write when data=ordered is flushing the inode Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7e38326f |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Handle checksumming errors while reading data blocks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9069218d |
|
08-Feb-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix i_blocks accounting Now that delayed allocation accounting works, i_blocks accounting is changed to only modify i_blocks when extents inserted or removed. The fillattr call is changed to include the delayed allocation byte count in the i_blocks result. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d1310b2e |
|
24-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Split the extent_map code into two parts There is now extent_map for mapping offsets in the file to disk and extent_io for state tracking, IO submission and extent_bufers. The new extent_map code shifts from [start,end] pairs to [start,len], and pushes the locking out into the caller. This allows a few performance optimizations and is easier to use. A number of extent_map usage bugs were fixed, mostly with failing to remove extent_map entries when changing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b98b6767 |
|
08-Jan-2008 |
Yan <yanzheng@21cn.com> |
Btrfs: Add inode flags support This patch adds NODATASUM & NODATACOW inode flags support. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dc17ff8f |
|
08-Jan-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add data=ordered support This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a52d9a80 |
|
27-Aug-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Extent based page cache code. This uses an rbtree of extents and tests instead of buffer heads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
15ee9bc7 |
|
10-Aug-2007 |
Josef Bacik <jwhiter@redhat.com> |
Btrfs: delay commits during fsync to allow more writers Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f1ace244 |
|
13-Jun-2007 |
Aneesh <aneesh.kumar@linux.vnet.ibm.com> |
btrfs: Code cleanup Attaching below is some of the code cleanups that i came across while reading the code. a) alloc_path already calls init_path. b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as the in memory copy ext4_inode as the disk copy Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6cbd5570 |
|
12-Jun-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add GPLv2 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31f3c99b |
|
30-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allocator improvements, inode block groups Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1b05da2e |
|
09-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: drop the inode map tree Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d6e4a428 |
|
06-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: start of support for many FS volumes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2c90e5d6 |
|
02-Apr-2007 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: still corruption hunting Signed-off-by: Chris Mason <chris.mason@oracle.com>
|