History log of /freebsd-10-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Revision Date Author Comments
# 338905 24-Sep-2018 markj

MFC r338724:
Fix an nvpair leak in vdev_geom_read_config().

PR: 230704


# 330524 05-Mar-2018 asomers

MFC r324940:

Fix the error message when creating a zpool on a too-small device

Don't check for SPA_MINDEVSIZE in vdev_geom_attach when opening by path.
It's redundant with the check in vdev_open, and failing to attach here
results in the wrong error message being printed. However, still check for
it in some other situations:

* When opening by guids, so we don't get bogged down reading from slow
devices like floppy drives.
* In vdev_geom_read_pool_label for the same reason, because we iterate over
all providers.
* If the caller requests that we verify the guid, because then we'll have to
read from the device before vdev_open verifies the size.

PR: 222227
Reported by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Reviewed by: avg, mav
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12531


# 330522 05-Mar-2018 asomers

MFC r326401:

Fix assertion when ZFS fails to open certain devices

"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:

1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.

The solution is to change the initialization order to ensure that the
assertion will be satisfied.

PR: 221066
Reported by: David NewHamlet <wheelcomplex@gmail.com>
Reviewed by: avg
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13278


# 326325 28-Nov-2017 asomers

MFC r322546:

Fix some ZFS debugging messages

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Be more careful about the use of provider names vs vdev names in
ZFS_LOG statements.

Sponsored by: Spectra Logic Corp


# 325913 16-Nov-2017 avg

MFC r325228: vdev_geom_close: close errored consumer even if vdev_reopening is set


# 319268 30-May-2017 asomers

MFC r318189:

vdev_geom may associate multiple vdevs per g_consumer

vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.

Fix this by storing a linked list of vdevs in g_consumer's private field.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c

* g_consumer.private now stores a linked list of vdev pointers associated
with the consumer instead of just a single vdev pointer.

* Change vdev_geom_set_physpath's signature to more closely match
vdev_geom_set_rotation_rate

* Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed
that we've already accessed the consumer by the time we get here.

* Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it
in vdev_geom_open, after we know that the open has succeeded.

PR: 218634
Reviewed by: gibbs
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10391


# 319262 30-May-2017 asomers

MFC r316760:

Fix vdev_geom_attach_by_guids for partitioned disks

When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.

The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot. Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.

Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.

Reviewed by: mav
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10365


# 308592 12-Nov-2016 mav

MFC r308055: Add vdev_reopening support to vdev_geom.

It allows to avoid extra GEOM providers flapping without significant need.
Since GEOM got resize support, we don't need to reopen provider to get new
size. If provider was orphaned and no longer valid, ZFS should already
know that, and in such case reopen should be done in full as expected.


# 308590 12-Nov-2016 mav

MFC r308051: Matching GUIDs, handle possible race on vdev detach.

In case of vdev detach, causing top level mirror vdev destruction, leaf
vdev changes its GUID to one of the destroyed mirror, that creates race
condition when GUID in vdev label may not match one in the pool config.

This change replicates logic nuance of vdev_validate() by adding special
exception, matching the vdev GUID against the top level vdev GUID.
Since this exception is not completely reliable (may give false positives
if we fail to erase label on detached vdev), use it only as last resort.

Quick way to reproduce this scenario now is detach vdev from a pool with
enabled autoextend. During vdev detach autoextend logic tries to reopen
remaining vdev, that always fails now since in-memory configuration is
already updated, while on-disk labels are not yet.


# 308588 12-Nov-2016 mav

MFC r308049: Improve few debugging log messages.


# 308061 28-Oct-2016 mav

MFC r300881, r302058 (by asomers):
Avoid issuing spa config updates for physical path when not necessary

ZFS's configuration needs to be updated whenever the physical path for a
device changes, but not when a new device is introduced. This is because new
devices necessarily cause config updates, but only if they are actually
accepted into the pool.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When
setting the vdev's physical path, only request a config update if
the physical path has changed. Don't request it when opening a
device for the first time, because the config sync will happen
anyway upstack.

sys/geom/geom_dev.c
Split g_dev_set_physpath and g_dev_set_media out of
g_dev_attrchanged


# 308060 28-Oct-2016 mav

MFC r300059 (by asomers): Speed up vdev_geom_open_by_guids

Speedup is hard to measure because the only time vdev_geom_open_by_guids
gets called on many drives at the same time is during boot. But with
vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations
like "zpool create" speed up by 65%.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c

* Read all of a vdev's labels in parallel instead of sequentially.
* In vdev_geom_read_config, don't read the entire label, including
the uberblock. That's a waste of RAM. Just read the vdev config
nvlist. Reduces the IO and RAM involved with tasting from 1MB to
448KB.


# 308059 28-Oct-2016 mav

MFC r298814 (by asomers): Fix a use-after-free when "zpool import" fails

clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach.
In the latter function, it would fail to happen in certain circumstances
where cp->private was unset. Ideally, the latter should never happen, but
it can happen when vdev open fails, or where spares are involved.


# 308058 28-Oct-2016 mav

MFC r298786 (by asomers):
Refactor vdev_geom_attach and friends to reduce code duplication

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Move checks for provider's sectorsize and mediasize into a single
location in vdev_geom_attach. Remove the zfs::vdev::taste class;
it's ok to use the regular vdev class for tasting. Consolidate guid
checks into a single location in vdev_attach_ok. Consolidate some
error handling code from vdev_geom_attach into vdev_geom_detach,
closing a resource leak of geom consumers in the process.


# 308057 28-Oct-2016 mav

MFC r294329 (by asomers): Disallow zvol-backed ZFS pools

Using zvols as backing devices for ZFS pools is fraught with panics and
deadlocks. For example, attempting to online a missing device in the
presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better
to completely disable vdev_geom from ever opening a zvol. The solution
relies on setting a thread-local variable during vdev_geom_open, and
returning EOPNOTSUPP during zvol_open if that thread-local variable is set.

Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent
was to prevent a recursive mutex acquisition panic. However, the new check
for the thread-local variable also fixes that problem.

Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this
function was set to panic. But it can occur that a device disappears during
tasting, and it causes no problems to ignore this departure.


# 307296 14-Oct-2016 mav

MFC r305456 (by avg): fix zfs pool creation accidentally broken by r305331

The upstream change introduced a new load state, SPA_LOAD_CREATE,
and vdev_geom code needs to be aware of it.


# 299958 16-May-2016 asomers

MFC r298072

Don't corrupt ZFS label's physpath attribute when booting while a disk is
missing

Prior to this change, vdev_geom_open_by_path would call vdev_geom_attach
prior to verifying the device's GUIDs. vdev_geom_attach calls
vdev_geom_attrchange to set the physpath in the vdev object. The result is
that if the disk could not be found, then the labels for other disks in the
same TLD would overwrite the missing disk's physpath with the physpath of
whichever disk currently has the same devname as the missing one used to
have.


# 299536 12-May-2016 asomers

MFC r297986, r298017 to vdev_geom.c

r297986 | asomers | 2016-04-14 13:20:31 -0600 (Thu, 14 Apr 2016) | 6 lines

Update a debugging message in vdev_geom_open_by_guids for consistency with
similar messages elsewhere in the file.

r298017 | asomers | 2016-04-14 17:14:41 -0600 (Thu, 14 Apr 2016) | 8 lines

Add more debugging statements in vdev_geom.c

Log a debugging message whenever geom functions fail in vdev_geom_attach.
Printing these messages is controlled by vfs.zfs.debug


# 299376 10-May-2016 asomers

MFC 297868

Fix rare double free in vdev_geom_attrchanged

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Don't drop the g_topology_lock before freeing old_physpath. That
opens up a race where one thread can call vdev_geom_attrchanged,
set old_physpath, drop the g_topology_lock, then block trying to
acquire the SCL_STATE lock. Then another thread can come into
vdev_geom_attrchanged, set old_physpath to the same value, and
proceed to free it. When the first thread resumes, it will free
the same location.

It turns out that the SCL_STATE lock isn't needed. It was
originally added by gibbs to protect vd->vdev_physpath while
updating the same. However, the update process subsequently was
switched to an atomic operation (a pointer swap). Now, there is
no need for the SCL_STATE lock, and hence no need to drop the
g_topology_lock.


# 297957 14-Apr-2016 mav

MFC r297672: Alike to r293708 relax pool check in vdev_geom_open_by_path().

This made impossible spare disk open by known path, which kind of worked
only because the same fix was applied to vdev_geom_attach_by_guids() in
r293708.


# 297123 20-Mar-2016 mav

MFC r296615: Make ZFS ignore stripe sizes above SPA_MAXASHIFT (8KB).

If device has stripe size bigger then maximal sector size supported by
ZFS, there is nothing can be done to avoid read-modify-write cycles.
Taking that stripe size into account will only reduce space efficiency
and pointlessly bother user with warnings that can not be fixed.


# 297122 20-Mar-2016 mav

MFC r296613: Make ZFS more picky to GEOM stripe sizes and offsets.

Use of misaligned or non-power-of-2 stripes is not really useful for ZFS,
since increased ashift won't help to avoid read-modify-write cycles, and
only reduce pool space efficiency and compression rates.


# 297108 20-Mar-2016 mav

MFC r296510, r296563, r296567: MFV r296505:
6531 Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@97e81309571898df9fdd94aab1216dfcf23e060b


# 297090 20-Mar-2016 mav

MFC r293677 (by asomers): Record physical path information in ZFS Vdevs

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
If available, record the physical path of a vdev in ZFS meta-data.
Do this both when opening the vdev, and when receiving an attribute
change notification from GEOM.

Make vdev_geom_close() synchronous instead of deferring its work to
a GEOM event handler. There is no benefit to deferring the work and
this prevents a future open call from referencing a consumer that is
scheduled for destruction. The close followed by an immediate open
will occur during a vdev reprobe triggered by any type of I/O error.

Consolidate vdev_geom_close() and vdev_geom_detach() into
vdev_geom_close() and vdev_geom_close_locked(). This also moves the
cross linking operations between vdev and GEOM consumer into a
single place (linking in vdev_geom_attach() and unlinking in
vdev_geom_close_locked()).

Differential Revision: https://reviews.freebsd.org/D4524


# 297078 20-Mar-2016 mav

MFC r274304 (by delphij): MFV r274272 and diff reduction with upstream.

Illumos issue:
5244 zio pipeline callers should explicitly invoke next stage


# 294843 26-Jan-2016 asomers

MFC r292066, r292069, r293708, r294027, and r294358, mostly to vdev_geom.c

r292066 | asomers | 2015-12-10 14:46:21 -0700 (Thu, 10 Dec 2015) | 25 lines

During vdev_geom_open, require that the vdev guids match the device's label
except during split, add, or create operations. This fixes a bug where the
wrong disk could be returned, and higher layers of ZFS would immediately
eject it again.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
o When opening by GUID, require both the pool and vdev GUIDs to
match. While it is highly unlikely for two vdevs to have the same
vdev GUIDs, the ZFS storage pool allocator only guarantees they
are unique within a pool.

o Modify the open behavior to:
- If we are opening a vdev that hasn't previously been opened,
open by path without checking GUIDs.
- Otherwise, open by path and verify GUIDs.
- If that fails, search all geom providers for a device with
matching GUIDs.
- If that fails, return ENOENT.

r292069 | asomers | 2015-12-10 17:04:13 -0700 (Thu, 10 Dec 2015) | 6 lines

Change an important error message from ZFS_LOG to printf

r293708 | asomers | 2016-01-11 15:15:46 -0700 (Mon, 11 Jan 2016) | 16 lines

Fix importing l2arc device by guid

After r292066, vdev_geom verifies both the vdev and pool guids of device
labels during open. However, spare and l2arc devices don't have pool guids,
so opening them by guid will fail (opening by path, when the pathname is
known, still succeeds). This change allows a vdev to be opened by guid if
the label contains no pool_guid, which is the case for inactive spares and
l2arc devices.

r294027 | asomers | 2016-01-14 11:19:05 -0700 (Thu, 14 Jan 2016) | 14 lines

Fix race condition involving ZFS remove events

When a ZFS drive disappears, ZFS sends a resource.fs.zfs.removed event to
userland. A userland program like zfsd(8) can use that event, for example to
activate a hotspare. The current code contains a race condition: vdev_geom
will sent the sysevent _before_ spa.c would update the vdev's status,
causing userland processes to see pool state that does not reflect the
device removal. This change moves the sysevent to spa.c, closing the race.

r294358 | asomers | 2016-01-19 16:16:24 -0700 (Tue, 19 Jan 2016) | 10 lines

Quell harmless CID about unchecked return value in nvlist_get_guids.

The return value doesn't need to be checked, because nvlist_get_guid's
callers check the returned values of the guids.


# 274800 21-Nov-2014 smh

MFC r274619:
Disable TRIM on file backed ZFS vdevs and fix TRIM on init

Sponsored by: Multiplay


# 271238 07-Sep-2014 smh

MFC r256956:
Improve ZFS N-way mirror read performance by using load and locality
information.

MFC r260713:
Fix ZFS mirror code for handling multiple DVA's

Also make the addition of the d_rotation_rate binary compatible. This allows
storage drivers compiled for 10.0 to work by preserving the ABI for disks.

Approved by: re (gjb)
Sponsored by: Multiplay


# 270312 21-Aug-2014 smh

MFC r265152 - Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
MFC r265321 - Fix double fault panic when returning EOPNOTSUPP
MFC r269407 - Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods

Sponsored by: Multiplay


# 266720 26-May-2014 smh

MFC r264885

Eliminate duplicate checks in vdev_geom_io_intr error handling

Sponsored by: Multiplay


# 260517 10-Jan-2014 asomers

MFC 259240
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.

Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.

Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.


# 260385 06-Jan-2014 scottl

MFC Alexander Motin's GEOM direct dispatch work:

r256603:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time. Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.

r256606:
Move g_io_deliver() out of the lock, as required for direct dispatch.
Move g_destroy_bio() out too to reduce lock scope even more.

r256607:
Fix passing uninitialized bio_resid argument to g_trace().

r256610:
Add unmapped I/O support to GEOM RAID.

r256830:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).

r256880:
Merge GEOM direct dispatch changes from the projects/camlock branch.

When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

r259247:
Fix bug introduced at r256607. We have to recalculate bp_resid here since
sizes of original and completed requests may differ due to end of media.

Testing of the stable/10 merge was done by Netflix, but all of the credit
goes to Alexander and iX Systems.

Submitted by: mav
Sponsored by: iX Systems


# 260339 05-Jan-2014 mav

MFC r259168:
Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.


# 274800 21-Nov-2014 smh

MFC r274619:
Disable TRIM on file backed ZFS vdevs and fix TRIM on init

Sponsored by: Multiplay


# 271238 07-Sep-2014 smh

MFC r256956:
Improve ZFS N-way mirror read performance by using load and locality
information.

MFC r260713:
Fix ZFS mirror code for handling multiple DVA's

Also make the addition of the d_rotation_rate binary compatible. This allows
storage drivers compiled for 10.0 to work by preserving the ABI for disks.

Approved by: re (gjb)
Sponsored by: Multiplay


# 270312 21-Aug-2014 smh

MFC r265152 - Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
MFC r265321 - Fix double fault panic when returning EOPNOTSUPP
MFC r269407 - Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods

Sponsored by: Multiplay


# 266720 26-May-2014 smh

MFC r264885

Eliminate duplicate checks in vdev_geom_io_intr error handling

Sponsored by: Multiplay


# 260517 10-Jan-2014 asomers

MFC 259240
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.

Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.

Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.


# 260385 06-Jan-2014 scottl

MFC Alexander Motin's GEOM direct dispatch work:

r256603:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time. Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.

r256606:
Move g_io_deliver() out of the lock, as required for direct dispatch.
Move g_destroy_bio() out too to reduce lock scope even more.

r256607:
Fix passing uninitialized bio_resid argument to g_trace().

r256610:
Add unmapped I/O support to GEOM RAID.

r256830:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).

r256880:
Merge GEOM direct dispatch changes from the projects/camlock branch.

When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

r259247:
Fix bug introduced at r256607. We have to recalculate bp_resid here since
sizes of original and completed requests may differ due to end of media.

Testing of the stable/10 merge was done by Netflix, but all of the credit
goes to Alexander and iX Systems.

Submitted by: mav
Sponsored by: iX Systems


# 260339 05-Jan-2014 mav

MFC r259168:
Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.