History log of /freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect.c
Revision Date Author Comments
# 358606 04-Mar-2020 mav

MFC r358342: MFZoL: Fix resilver writes in vdev_indirect_io_start

This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8193
zfsonlinux/zfs@5aa95ba0d3502779695341b5f55fa5ba1d3330ff


# 339440 19-Oct-2018 mav

MFC r339329: Add ZIO_TYPE_FREE support for indirect vdevs.

Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE
requests to removed (indirect) vdevs, while on FreeBSD there is also
ZIO_TYPE_FREE (TRIM). ZIO_TYPE_FREE requests do not have the data
buffers, so don't need the pointer adjustment.

PR: 228750, 229007


# 339345 13-Oct-2018 mav

MFC r339288: Remove extra thread_exit() call left after r329802.

spa_condense_indirect_thread() is no longer a thread function, but just
a callback for new zthr KPI.


# 339111 03-Oct-2018 mav

MFC r337007: MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>


# 339106 03-Oct-2018 mav

MFC r336951: MFV r336950: 9290 device removal reduces redundancy of mirrors

Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


# 339104 03-Oct-2018 mav

MFC r336947: MFV r336946: 9238 ZFS Spacemap Encoding V2

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>


# 339034 01-Oct-2018 sef

MFC r334844, r336180, r336458

r334844

This originated from ZFS On Linux, as
https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

r336180

Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges. One of the missing changes
included a notification on scrub cancellation.

r336458

Fix a couple of typos in r334844 noticed by Richard Kojedzinszky

Approved by: mav
Sponsored by: iXsystems, Inc


# 332547 16-Apr-2018 mav

MFC r331701: MFV r331695, 331700: 9166 zfs storage pool checkpoint

illumos/illumos-gate@8671400134a11c848244896ca51a7db4d0f69da4

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>


# 332538 16-Apr-2018 mav

MFC r329805: MFV r329803:
9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()

illumos/illumos-gate@bdfded42e66b9fc1395ff2401aa2952f7c44ae34

A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls
vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that
was already acquired from the first call to vdev_indirect_remap(). The specific scenario,
is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist
is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the
same vdev, which brings up the aforementioned issue.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>


# 332537 16-Apr-2018 mav

MFC r329802: MFV r329799, r329800:
9079 race condition in starting and ending condesing thread for indirect vdevs

illumos/illumos-gate@667ec66f1b4f491d5e839644e0912cad1c9e7122

The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(),
so it calls the spa_condense_indirect_complete_sync() sync task which sets the
spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A
sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock
and set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks
whether it should condense the second vdev in vdev_indirect_should_condense() by checking
the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread()
from thread A. So it goes on and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A
has not set spa_condense_thread to NULL (which is basically the last thing it does before
returning).

The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to
signify whether a condensing thread is running. Ideally we would only use one throughout the
codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which
basically tights condensing to scrubing when it comes to pausing and resuming those actions
during spa export.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>


# 332525 16-Apr-2018 mav

MFC r329732: MFV r329502: 7614 zfs device evacuation/removal

illumos/illumos-gate@5cabbc6b49070407fb9610cfe73d4c0e0dea3e77

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>