History log of /freebsd-10-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
Revision Date Author Comments
# 324204 02-Oct-2017 avg

MFC r323918: MFV r323917: 8648 Fix range locking in ZIL commit codepath

This fixes a problem introduced in r320496, MFC of r308782.


# 320496 30-Jun-2017 avg

MFC r308782:
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.


# 308596 12-Nov-2016 mav

MFC r308173:
Fix ZIL records ordering when ZVOL opened both with and without FSYNC.

Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.


# 308594 12-Nov-2016 mav

MFC r308169:
Pass to zvol_log_truncate() same sync values as to zvol_log_write().

Surplus marking of TX_TRUNCATE records as sync could result in putting them
into ZIL before previous writes if ones were async.


# 308448 08-Nov-2016 mav

MFC r307857: Fix panic after ZVOL renamed to name invalid for DEVFS.


# 308057 28-Oct-2016 mav

MFC r294329 (by asomers): Disallow zvol-backed ZFS pools

Using zvols as backing devices for ZFS pools is fraught with panics and
deadlocks. For example, attempting to online a missing device in the
presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better
to completely disable vdev_geom from ever opening a zvol. The solution
relies on setting a thread-local variable during vdev_geom_open, and
returning EOPNOTSUPP during zvol_open if that thread-local variable is set.

Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent
was to prevent a recursive mutex acquisition panic. However, the new check
for the thread-local variable also fixes that problem.

Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this
function was set to panic. But it can occur that a device disappears during
tasting, and it causes no problems to ignore this departure.


# 297549 04-Apr-2016 mav

MFC r297421: Plug open count leak on zvol rename.


# 297548 04-Apr-2016 mav

MFC r297420: Switch from using make_dev_p() to make_dev_s() to close races.


# 297547 04-Apr-2016 mav

MFC r297337: Pass through error code from make_dev_p().

ENAMETOOLONG is much more informative in logs then ENXIO.


# 297546 04-Apr-2016 mav

MFC r297232: Unify ignoring EEXIST from zvol_create_minor().

This fixes creation of zvol devices for snapshots during zfs receive,
that previously failed with "ZFS WARNING: Unable to create ZVOL" message.
This solution is not perfect, but IMHO better then it was before.


# 297112 20-Mar-2016 mav

MFC r296519: MFV r296518: 5027 zfs large block support (add copyright)

Author: Matthew Ahrens <matt@mahrens.org>

illumos/illumos-gate@c3d26abc9ee97b4f60233556aadeb57e0bd30bb9


# 290746 13-Nov-2015 mav

MFC r289190: 6250 zvol_dump_init() can hold txg open

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@b10bba72460aeaa53119c76ff5e647fd5585bece


# 288571 03-Oct-2015 mav

MFC r286705: 5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().

Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.


# 288520 02-Oct-2015 mav

MFC r279996 (by smh): Allow zvol_geom_worker to process BIO_DELETE's

If zvol_geom_start is called with a BIO_DELETE from a thread which can
sleep it queues it for later processing by the zvol_geom_worker. The
zvol_geom_worker didn't have a delete case so would simply loose the bio
hence preventing the original caller from every completing. In addition
an other unknown types would suffer the same fate.

Allow zvol_geom_worker to process BIO_DELETE's via zvol_strategy and
return unsupported for all unknown bio types.


# 280753 27-Mar-2015 mav

MFC r279927: Make DIOCGATTR in device mode handle "GEOM::candelete".


# 277699 25-Jan-2015 mav

MFC r276913: Use new optimized dmu_read_uio_dbuf() for ZVOLs in device mode.

This slightly reduces overhead by avoiding dnode_hold()/dnode_rele() calls.


# 277483 21-Jan-2015 smh

MFC r276063:
Standardise on illumos for #ifdef's in zvol.c

MFC r276066:
Refactor zvol locking to minimise diff with upstream

MFC r276069:
Fix panic when resizing ZFS zvol's

Sponsored by: Multiplay


# 277482 21-Jan-2015 smh

MFC r272509 (by delphi):
Diff reduction with upstream

Sponsored by: Multiplay


# 276081 22-Dec-2014 delphij

MFC r274337,r274673,274681,r275515:

ZFS large block support. The default recordsize remains at 128KB.

A new tunable/sysctl variable, vfs.zfs.max_recordsize is added to
allow adjusting the permitted maximum record size, or
zfs_max_recordsize, with a default of 1MB. ZFS will not allow
setting recordsize greater than zfs_max_recordsize as a safety
belt, because larger recordsize means greater read and write
latency and more memory usage.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool).

Limited safety belt is provided for mounted root filesystem but use
caution when using a larger value.

Illumos issue:
5027 zfs large block support


# 275892 18-Dec-2014 mav

MFC r275474: Add GET LBA STATUS command support to CTL.

It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.

Sponsored by: iXsystems, Inc.


# 274732 20-Nov-2014 mav

MFC r274154, r274163:
Add to CTL support for logical block provisioning threshold notifications.

For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.

Sponsored by: iXsystems, Inc.


# 273345 20-Oct-2014 delphij

MFC r272510: MFV r272498:

Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system
administrator to toggle whether ZFS should ignore UNMAP requests.

Illumos issue:
5149 zvols need a way to ignore DKIOCFREE


# 272883 09-Oct-2014 smh

MFC r272474:
Fix various issues with zvols

Sponsored by: Multiplay


# 272615 06-Oct-2014 mav

MFC r271308:
Make ZVOL writes in device mode support IO_SYNC flag.


# 269006 22-Jul-2014 delphij

MFC r268473: MFV r268455:

Use reserved space for ZFS administrative commands.


# 269002 22-Jul-2014 delphij

MFC r268464: MFV r268452:

Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue: 4950 files sometimes can't be removed from a full
filesystem


# 268657 15-Jul-2014 delphij

MFC r268123: MFV r268119:

4914 zfs on-disk bookmark structure should be named *_phys_t


# 268649 15-Jul-2014 delphij

MFC r268075: MFV r267565:

4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks


# 268274 04-Jul-2014 mav

MFC r268178:
Fix bug in sync control in new "dev" mode of ZVOL (r265678).

Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".


# 265678 08-May-2014 mav

MFC r264145:
Add property and sysctl to control how ZVOLs are exposed to OS.

New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.


# 265677 08-May-2014 mav

MFC r264086:
3580 Want zvols to return volblocksize when queried for physical block size

illumos/illumos-gate@a0b60564dfc644f4bfaef1ce26d343b44cf68bc5

It is irrelevant for FreeBSD, just reducing diff.


# 264733 21-Apr-2014 mav

MFC r264193:
In addition to r264077, tell GEOM that we do support BIO_DELETE now.


# 264732 21-Apr-2014 mav

MFC r264077:
Add BIO_DELETE support to ZVOL.

It is an adapted merge from the vendor branch of:
701 UNMAP support for COMSTAR (in part related to ZFS)
2130 zvol DKIOCFREE uses nested DMU transactions


# 263987 01-Apr-2014 mav

MFC r263118:
Report ZVOL block size as GEOM stripesize.


# 263397 19-Mar-2014 delphij

MFC r260150: MFV r259170:

4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47bfcd2ad9bf501faec8e75c08095e4f

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.


# 263390 19-Mar-2014 delphij

MFC r259813 + r259813: MFV r258374:

4171 clean up spa_feature_*() interfaces

4172 implement extensible_dataset feature for use by other zpool
features

illumos/illumos-gate@2acef22db7808606888f8f92715629ff3ba555b9


# 260385 06-Jan-2014 scottl

MFC Alexander Motin's GEOM direct dispatch work:

r256603:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time. Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.

r256606:
Move g_io_deliver() out of the lock, as required for direct dispatch.
Move g_destroy_bio() out too to reduce lock scope even more.

r256607:
Fix passing uninitialized bio_resid argument to g_trace().

r256610:
Add unmapped I/O support to GEOM RAID.

r256830:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).

r256880:
Merge GEOM direct dispatch changes from the projects/camlock branch.

When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

r259247:
Fix bug introduced at r256607. We have to recalculate bp_resid here since
sizes of original and completed requests may differ due to end of media.

Testing of the stable/10 merge was done by Netflix, but all of the credit
goes to Alexander and iX Systems.

Submitted by: mav
Sponsored by: iX Systems


# 288571 03-Oct-2015 mav

MFC r286705: 5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().

Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.


# 288520 02-Oct-2015 mav

MFC r279996 (by smh): Allow zvol_geom_worker to process BIO_DELETE's

If zvol_geom_start is called with a BIO_DELETE from a thread which can
sleep it queues it for later processing by the zvol_geom_worker. The
zvol_geom_worker didn't have a delete case so would simply loose the bio
hence preventing the original caller from every completing. In addition
an other unknown types would suffer the same fate.

Allow zvol_geom_worker to process BIO_DELETE's via zvol_strategy and
return unsupported for all unknown bio types.


# 280753 27-Mar-2015 mav

MFC r279927: Make DIOCGATTR in device mode handle "GEOM::candelete".


# 277699 25-Jan-2015 mav

MFC r276913: Use new optimized dmu_read_uio_dbuf() for ZVOLs in device mode.

This slightly reduces overhead by avoiding dnode_hold()/dnode_rele() calls.


# 277483 21-Jan-2015 smh

MFC r276063:
Standardise on illumos for #ifdef's in zvol.c

MFC r276066:
Refactor zvol locking to minimise diff with upstream

MFC r276069:
Fix panic when resizing ZFS zvol's

Sponsored by: Multiplay


# 277482 21-Jan-2015 smh

MFC r272509 (by delphi):
Diff reduction with upstream

Sponsored by: Multiplay


# 276081 22-Dec-2014 delphij

MFC r274337,r274673,274681,r275515:

ZFS large block support. The default recordsize remains at 128KB.

A new tunable/sysctl variable, vfs.zfs.max_recordsize is added to
allow adjusting the permitted maximum record size, or
zfs_max_recordsize, with a default of 1MB. ZFS will not allow
setting recordsize greater than zfs_max_recordsize as a safety
belt, because larger recordsize means greater read and write
latency and more memory usage.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool).

Limited safety belt is provided for mounted root filesystem but use
caution when using a larger value.

Illumos issue:
5027 zfs large block support


# 275892 18-Dec-2014 mav

MFC r275474: Add GET LBA STATUS command support to CTL.

It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.

Sponsored by: iXsystems, Inc.


# 274732 20-Nov-2014 mav

MFC r274154, r274163:
Add to CTL support for logical block provisioning threshold notifications.

For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.

Sponsored by: iXsystems, Inc.


# 273345 20-Oct-2014 delphij

MFC r272510: MFV r272498:

Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system
administrator to toggle whether ZFS should ignore UNMAP requests.

Illumos issue:
5149 zvols need a way to ignore DKIOCFREE


# 272883 09-Oct-2014 smh

MFC r272474:
Fix various issues with zvols

Sponsored by: Multiplay


# 272615 06-Oct-2014 mav

MFC r271308:
Make ZVOL writes in device mode support IO_SYNC flag.


# 269006 22-Jul-2014 delphij

MFC r268473: MFV r268455:

Use reserved space for ZFS administrative commands.


# 269002 22-Jul-2014 delphij

MFC r268464: MFV r268452:

Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue: 4950 files sometimes can't be removed from a full
filesystem


# 268657 15-Jul-2014 delphij

MFC r268123: MFV r268119:

4914 zfs on-disk bookmark structure should be named *_phys_t


# 268649 15-Jul-2014 delphij

MFC r268075: MFV r267565:

4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks


# 268274 04-Jul-2014 mav

MFC r268178:
Fix bug in sync control in new "dev" mode of ZVOL (r265678).

Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".


# 265678 08-May-2014 mav

MFC r264145:
Add property and sysctl to control how ZVOLs are exposed to OS.

New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.


# 265677 08-May-2014 mav

MFC r264086:
3580 Want zvols to return volblocksize when queried for physical block size

illumos/illumos-gate@a0b60564dfc644f4bfaef1ce26d343b44cf68bc5

It is irrelevant for FreeBSD, just reducing diff.


# 264733 21-Apr-2014 mav

MFC r264193:
In addition to r264077, tell GEOM that we do support BIO_DELETE now.


# 264732 21-Apr-2014 mav

MFC r264077:
Add BIO_DELETE support to ZVOL.

It is an adapted merge from the vendor branch of:
701 UNMAP support for COMSTAR (in part related to ZFS)
2130 zvol DKIOCFREE uses nested DMU transactions


# 263987 01-Apr-2014 mav

MFC r263118:
Report ZVOL block size as GEOM stripesize.


# 263397 19-Mar-2014 delphij

MFC r260150: MFV r259170:

4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47bfcd2ad9bf501faec8e75c08095e4f

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.


# 263390 19-Mar-2014 delphij

MFC r259813 + r259813: MFV r258374:

4171 clean up spa_feature_*() interfaces

4172 implement extensible_dataset feature for use by other zpool
features

illumos/illumos-gate@2acef22db7808606888f8f92715629ff3ba555b9


# 260385 06-Jan-2014 scottl

MFC Alexander Motin's GEOM direct dispatch work:

r256603:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time. Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.

r256606:
Move g_io_deliver() out of the lock, as required for direct dispatch.
Move g_destroy_bio() out too to reduce lock scope even more.

r256607:
Fix passing uninitialized bio_resid argument to g_trace().

r256610:
Add unmapped I/O support to GEOM RAID.

r256830:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).

r256880:
Merge GEOM direct dispatch changes from the projects/camlock branch.

When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

r259247:
Fix bug introduced at r256607. We have to recalculate bp_resid here since
sizes of original and completed requests may differ due to end of media.

Testing of the stable/10 merge was done by Netflix, but all of the credit
goes to Alexander and iX Systems.

Submitted by: mav
Sponsored by: iX Systems