Print this page
NEX-20218 Backport Illumos #9474 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
MFV illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe
    9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Andriy Gapon <avg@FreeBSD.org>
    Approved by: Dan McDonald <danmcd@joyent.com>
NEX-20208 Backport Illumos #9993 zil writes can get delayed in zio pipeline
MFV illumos-gate@2258ad0b755b24a55c6173b1e6bb6188389f72dd
    9993 zil writes can get delayed in zio pipeline
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Approved by: Dan McDonald <danmcd@joyent.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8065 ZFS doesn't notice when disk vdevs have no write cache
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
NEX-5856 ddt_capped isn't reset when deduped dataset is destroyed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5188 Removed special-vdev causes panic on read or on get size of special-bp
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4794 Write Back Cache sync and async writes: adjust routing according to watermark limits
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4619 Want kstats to monitor TRIM and UNMAP operation
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5438 zfs_blkptr_verify should continue after zfs_panic_recover
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-4003 WRC: System panics on debug build
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
NEX-3411 Removal of small l2arc ddt vdev disables dedup despite enough RAM
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-1110 Odd zpool Latency Output
OS-70 remove zio timer code
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
Fixup merge results
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
SUP-504 Multiple disks being falsely failed/retired by new zio_timeout handling code
re #12770 rb4121 zio latency reports can produce false positives
re #12645 rb4073 Make vdev delay simulator independent of DEBUG
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #12616 rb4051 zfs_log_write()/dmu_sync() write once to special refactoring
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

*** 16,30 **** * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011, 2017 by Delphix. All rights reserved. - * Copyright (c) 2011 Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ #include <sys/sysmacros.h> #include <sys/zfs_context.h> #include <sys/fm/fs/zfs.h> --- 16,31 ---- * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ + /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011, 2017 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] + * Copyright 2017 Nexenta Systems, Inc. All rights reserved. */ #include <sys/sysmacros.h> #include <sys/zfs_context.h> #include <sys/fm/fs/zfs.h>
*** 37,50 **** --- 38,58 ---- #include <sys/zio_checksum.h> #include <sys/dmu_objset.h> #include <sys/arc.h> #include <sys/ddt.h> #include <sys/blkptr.h> + #include <sys/special.h> + #include <sys/blkptr.h> #include <sys/zfeature.h> + #include <sys/dkioc_free_util.h> + #include <sys/dsl_scan.h> + #include <sys/metaslab_impl.h> #include <sys/abd.h> + extern int zfs_txg_timeout; + /* * ========================================================================== * I/O type descriptions * ========================================================================== */
*** 67,82 **** #ifdef _KERNEL extern vmem_t *zio_alloc_arena; #endif - #define ZIO_PIPELINE_CONTINUE 0x100 - #define ZIO_PIPELINE_STOP 0x101 - #define BP_SPANB(indblkshift, level) \ (((uint64_t)1) << ((level) * ((indblkshift) - SPA_BLKPTRSHIFT))) #define COMPARE_META_LEVEL 0x80000000ul /* * The following actions directly effect the spa's sync-to-convergence logic. * The values below define the sync pass when we start performing the action. * Care should be taken when changing these values as they directly impact * spa_sync() performance. Tuning these values may introduce subtle performance --- 75,88 ---- #ifdef _KERNEL extern vmem_t *zio_alloc_arena; #endif #define BP_SPANB(indblkshift, level) \ (((uint64_t)1) << ((level) * ((indblkshift) - SPA_BLKPTRSHIFT))) #define COMPARE_META_LEVEL 0x80000000ul + /* * The following actions directly effect the spa's sync-to-convergence logic. * The values below define the sync pass when we start performing the action. * Care should be taken when changing these values as they directly impact * spa_sync() performance. Tuning these values may introduce subtle performance
*** 103,112 **** --- 109,133 ---- int zio_buf_debug_limit = 16384; #else int zio_buf_debug_limit = 0; #endif + /* + * Fault insertion for stress testing + */ + int zio_faulty_vdev_enabled = 0; + uint64_t zio_faulty_vdev_guid; + uint64_t zio_faulty_vdev_delay_us = 1000000; /* 1 second */ + + /* + * Tunable to allow for debugging SCSI UNMAP/SATA TRIM calls. Disabling + * it will prevent ZFS from attempting to issue DKIOCFREE ioctls to the + * underlying storage. + */ + boolean_t zfs_trim = B_TRUE; + uint64_t zfs_trim_min_ext_sz = 1 << 20; /* 1 MB */ + static void zio_taskq_dispatch(zio_t *, zio_taskq_type_t, boolean_t); void zio_init(void) {
*** 178,187 **** --- 199,209 ---- if (zio_data_buf_cache[c - 1] == NULL) zio_data_buf_cache[c - 1] = zio_data_buf_cache[c]; } zio_inject_init(); + } void zio_fini(void) {
*** 440,469 **** kmem_cache_free(zio_link_cache, zl); } static boolean_t ! zio_wait_for_children(zio_t *zio, uint8_t childbits, enum zio_wait_type wait) { boolean_t waiting = B_FALSE; mutex_enter(&zio->io_lock); ASSERT(zio->io_stall == NULL); - for (int c = 0; c < ZIO_CHILD_TYPES; c++) { - if (!(ZIO_CHILD_BIT_IS_SET(childbits, c))) - continue; - - uint64_t *countp = &zio->io_children[c][wait]; if (*countp != 0) { zio->io_stage >>= 1; ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN); zio->io_stall = countp; waiting = B_TRUE; - break; } - } mutex_exit(&zio->io_lock); return (waiting); } static void zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait) --- 462,486 ---- kmem_cache_free(zio_link_cache, zl); } static boolean_t ! zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait) { + uint64_t *countp = &zio->io_children[child][wait]; boolean_t waiting = B_FALSE; mutex_enter(&zio->io_lock); ASSERT(zio->io_stall == NULL); if (*countp != 0) { zio->io_stage >>= 1; ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN); zio->io_stall = countp; waiting = B_TRUE; } mutex_exit(&zio->io_lock); + return (waiting); } static void zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait)
*** 617,631 **** --- 634,653 ---- if (zb != NULL) zio->io_bookmark = *zb; if (pio != NULL) { + zio->io_mc = pio->io_mc; if (zio->io_logical == NULL) zio->io_logical = pio->io_logical; if (zio->io_child_type == ZIO_CHILD_GANG) zio->io_gang_leader = pio->io_gang_leader; zio_add_child(pio, zio); + + /* copy the smartcomp setting when creating child zio's */ + bcopy(&pio->io_smartcomp, &zio->io_smartcomp, + sizeof (zio->io_smartcomp)); } return (zio); }
*** 660,669 **** --- 682,699 ---- } void zfs_blkptr_verify(spa_t *spa, const blkptr_t *bp) { + /* + * SPECIAL-BP has two DVAs, but DVA[0] in this case is a + * temporary DVA, and after migration only the DVA[1] + * contains valid data. Therefore, we start walking for + * these BPs from DVA[1]. + */ + int start_dva = BP_IS_SPECIAL(bp) ? 1 : 0; + if (!DMU_OT_IS_VALID(BP_GET_TYPE(bp))) { zfs_panic_recover("blkptr at %p has invalid TYPE %llu", bp, (longlong_t)BP_GET_TYPE(bp)); } if (BP_GET_CHECKSUM(bp) >= ZIO_CHECKSUM_FUNCTIONS ||
*** 691,715 **** bp, (longlong_t)BPE_GET_ETYPE(bp)); } } /* - * Do not verify individual DVAs if the config is not trusted. This - * will be done once the zio is executed in vdev_mirror_map_alloc. - */ - if (!spa->spa_trust_config) - return; - - /* * Pool-specific checks. * * Note: it would be nice to verify that the blk_birth and * BP_PHYSICAL_BIRTH() are not too large. However, spa_freeze() * allows the birth time of log blocks (and dmu_sync()-ed blocks * that are in the log) to be arbitrarily large. */ ! for (int i = 0; i < BP_GET_NDVAS(bp); i++) { uint64_t vdevid = DVA_GET_VDEV(&bp->blk_dva[i]); if (vdevid >= spa->spa_root_vdev->vdev_children) { zfs_panic_recover("blkptr at %p DVA %u has invalid " "VDEV %llu", bp, i, (longlong_t)vdevid); --- 721,738 ---- bp, (longlong_t)BPE_GET_ETYPE(bp)); } } /* * Pool-specific checks. * * Note: it would be nice to verify that the blk_birth and * BP_PHYSICAL_BIRTH() are not too large. However, spa_freeze() * allows the birth time of log blocks (and dmu_sync()-ed blocks * that are in the log) to be arbitrarily large. */ ! for (int i = start_dva; i < BP_GET_NDVAS(bp); i++) { uint64_t vdevid = DVA_GET_VDEV(&bp->blk_dva[i]); if (vdevid >= spa->spa_root_vdev->vdev_children) { zfs_panic_recover("blkptr at %p DVA %u has invalid " "VDEV %llu", bp, i, (longlong_t)vdevid);
*** 746,785 **** bp, i, (longlong_t)offset); } } } - boolean_t - zfs_dva_valid(spa_t *spa, const dva_t *dva, const blkptr_t *bp) - { - uint64_t vdevid = DVA_GET_VDEV(dva); - - if (vdevid >= spa->spa_root_vdev->vdev_children) - return (B_FALSE); - - vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid]; - if (vd == NULL) - return (B_FALSE); - - if (vd->vdev_ops == &vdev_hole_ops) - return (B_FALSE); - - if (vd->vdev_ops == &vdev_missing_ops) { - return (B_FALSE); - } - - uint64_t offset = DVA_GET_OFFSET(dva); - uint64_t asize = DVA_GET_ASIZE(dva); - - if (BP_IS_GANG(bp)) - asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); - if (offset + asize > vd->vdev_asize) - return (B_FALSE); - - return (B_TRUE); - } - zio_t * zio_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, abd_t *data, uint64_t size, zio_done_func_t *done, void *private, zio_priority_t priority, enum zio_flag flags, const zbookmark_phys_t *zb) { --- 769,778 ----
*** 800,810 **** zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, abd_t *data, uint64_t lsize, uint64_t psize, const zio_prop_t *zp, zio_done_func_t *ready, zio_done_func_t *children_ready, zio_done_func_t *physdone, zio_done_func_t *done, void *private, zio_priority_t priority, enum zio_flag flags, ! const zbookmark_phys_t *zb) { zio_t *zio; ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF && zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS && --- 793,804 ---- zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, abd_t *data, uint64_t lsize, uint64_t psize, const zio_prop_t *zp, zio_done_func_t *ready, zio_done_func_t *children_ready, zio_done_func_t *physdone, zio_done_func_t *done, void *private, zio_priority_t priority, enum zio_flag flags, ! const zbookmark_phys_t *zb, ! const zio_smartcomp_info_t *smartcomp) { zio_t *zio; ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF && zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
*** 822,831 **** --- 816,827 ---- zio->io_ready = ready; zio->io_children_ready = children_ready; zio->io_physdone = physdone; zio->io_prop = *zp; + if (smartcomp != NULL) + bcopy(smartcomp, &zio->io_smartcomp, sizeof (*smartcomp)); /* * Data can be NULL if we are going to call zio_write_override() to * provide the already-allocated BP. But we may need the data to * verify a dedup hit (if requested). In this case, don't try to
*** 873,884 **** void zio_free(spa_t *spa, uint64_t txg, const blkptr_t *bp) { - zfs_blkptr_verify(spa, bp); - /* * The check for EMBEDDED is a performance optimization. We * process the free here (by ignoring it) rather than * putting it on the list and then processing it in zio_free_sync(). */ --- 869,878 ----
*** 915,924 **** --- 909,919 ---- if (BP_IS_EMBEDDED(bp)) return (zio_null(pio, spa, NULL, NULL, NULL, 0)); metaslab_check_free(spa, bp); arc_freed(spa, bp); + dsl_scan_freed(spa, bp); /* * GANG and DEDUP blocks can induce a read (for the gang block header, * or the DDT), so issue them asynchronously so that this thread is * not tied up.
*** 937,947 **** zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp, zio_done_func_t *done, void *private, enum zio_flag flags) { zio_t *zio; ! zfs_blkptr_verify(spa, bp); if (BP_IS_EMBEDDED(bp)) return (zio_null(pio, spa, NULL, NULL, NULL, 0)); /* --- 932,942 ---- zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp, zio_done_func_t *done, void *private, enum zio_flag flags) { zio_t *zio; ! dprintf_bp(bp, "claiming in txg %llu", txg); if (BP_IS_EMBEDDED(bp)) return (zio_null(pio, spa, NULL, NULL, NULL, 0)); /*
*** 966,1000 **** ASSERT0(zio->io_queued_timestamp); return (zio); } ! zio_t * ! zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd, ! zio_done_func_t *done, void *private, enum zio_flag flags) { zio_t *zio; int c; if (vd->vdev_children == 0) { zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private, ZIO_TYPE_IOCTL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL, ! ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE); zio->io_cmd = cmd; } else { ! zio = zio_null(pio, spa, NULL, NULL, NULL, flags); ! for (c = 0; c < vd->vdev_children; c++) ! zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd, ! done, private, flags)); } return (zio); } zio_t * zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size, abd_t *data, int checksum, zio_done_func_t *done, void *private, zio_priority_t priority, enum zio_flag flags, boolean_t labels) { zio_t *zio; --- 961,1119 ---- ASSERT0(zio->io_queued_timestamp); return (zio); } ! static zio_t * ! zio_ioctl_with_pipeline(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd, ! zio_done_func_t *done, void *private, enum zio_flag flags, ! enum zio_stage pipeline) { zio_t *zio; int c; if (vd->vdev_children == 0) { zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private, ZIO_TYPE_IOCTL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL, ! ZIO_STAGE_OPEN, pipeline); zio->io_cmd = cmd; } else { ! zio = zio_null(pio, spa, vd, done, private, flags); ! /* ! * DKIOCFREE ioctl's need some special handling on interior ! * vdevs. If the device provides an ops function to handle ! * recomputing dkioc_free extents, then we call it. ! * Otherwise the default behavior applies, which simply fans ! * out the ioctl to all component vdevs. ! */ ! if (cmd == DKIOCFREE && vd->vdev_ops->vdev_op_trim != NULL) { ! vd->vdev_ops->vdev_op_trim(vd, zio, private); ! } else { for (c = 0; c < vd->vdev_children; c++) ! zio_nowait(zio_ioctl_with_pipeline(zio, ! spa, vd->vdev_child[c], cmd, NULL, ! private, flags, pipeline)); } + } return (zio); } zio_t * + zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd, + zio_done_func_t *done, void *private, enum zio_flag flags) + { + return (zio_ioctl_with_pipeline(pio, spa, vd, cmd, done, + private, flags, ZIO_IOCTL_PIPELINE)); + } + + /* + * Callback for when a trim zio has completed. This simply frees the + * dkioc_free_list_t extent list of the DKIOCFREE ioctl. + */ + static void + zio_trim_done(zio_t *zio) + { + VERIFY(zio->io_private != NULL); + dfl_free(zio->io_private); + } + + static void + zio_trim_check(uint64_t start, uint64_t len, void *msp) + { + metaslab_t *ms = msp; + boolean_t held = MUTEX_HELD(&ms->ms_lock); + if (!held) + mutex_enter(&ms->ms_lock); + ASSERT(ms->ms_trimming_ts != NULL); + ASSERT(range_tree_contains(ms->ms_trimming_ts->ts_tree, + start - VDEV_LABEL_START_SIZE, len)); + if (!held) + mutex_exit(&ms->ms_lock); + } + + /* + * Takes a bunch of freed extents and tells the underlying vdevs that the + * space associated with these extents can be released. + * This is used by flash storage to pre-erase blocks for rapid reuse later + * and thin-provisioned block storage to reclaim unused blocks. + */ + zio_t * + zio_trim(spa_t *spa, vdev_t *vd, struct range_tree *tree, + zio_done_func_t *done, void *private, enum zio_flag flags, + int trim_flags, metaslab_t *msp) + { + dkioc_free_list_t *dfl = NULL; + range_seg_t *rs; + uint64_t rs_idx; + uint64_t num_exts; + uint64_t bytes_issued = 0, bytes_skipped = 0, exts_skipped = 0; + /* + * We need this to invoke the caller's `done' callback with the + * correct io_private (not the dkioc_free_list_t, which is needed + * by the underlying DKIOCFREE ioctl). + */ + zio_t *sub_pio = zio_root(spa, done, private, flags); + + ASSERT(range_tree_space(tree) != 0); + + if (!zfs_trim) + return (sub_pio); + + num_exts = avl_numnodes(&tree->rt_root); + dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP); + dfl->dfl_flags = trim_flags; + dfl->dfl_num_exts = num_exts; + dfl->dfl_offset = VDEV_LABEL_START_SIZE; + if (msp) { + dfl->dfl_ck_func = zio_trim_check; + dfl->dfl_ck_arg = msp; + } + + for (rs = avl_first(&tree->rt_root), rs_idx = 0; rs != NULL; + rs = AVL_NEXT(&tree->rt_root, rs)) { + uint64_t len = rs->rs_end - rs->rs_start; + + if (len < zfs_trim_min_ext_sz) { + bytes_skipped += len; + exts_skipped++; + continue; + } + + dfl->dfl_exts[rs_idx].dfle_start = rs->rs_start; + dfl->dfl_exts[rs_idx].dfle_length = len; + + // check we're a multiple of the vdev ashift + ASSERT0(dfl->dfl_exts[rs_idx].dfle_start & + ((1 << vd->vdev_ashift) - 1)); + ASSERT0(dfl->dfl_exts[rs_idx].dfle_length & + ((1 << vd->vdev_ashift) - 1)); + + rs_idx++; + bytes_issued += len; + } + + spa_trimstats_update(spa, rs_idx, bytes_issued, exts_skipped, + bytes_skipped); + + /* the zfs_trim_min_ext_sz filter may have shortened the list */ + if (dfl->dfl_num_exts != rs_idx) { + dkioc_free_list_t *dfl2 = kmem_zalloc(DFL_SZ(rs_idx), KM_SLEEP); + bcopy(dfl, dfl2, DFL_SZ(rs_idx)); + dfl2->dfl_num_exts = rs_idx; + dfl_free(dfl); + dfl = dfl2; + } + + zio_nowait(zio_ioctl_with_pipeline(sub_pio, spa, vd, DKIOCFREE, + zio_trim_done, dfl, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | + ZIO_FLAG_DONT_RETRY, ZIO_TRIM_PIPELINE)); + return (sub_pio); + } + + zio_t * zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size, abd_t *data, int checksum, zio_done_func_t *done, void *private, zio_priority_t priority, enum zio_flag flags, boolean_t labels) { zio_t *zio;
*** 1056,1086 **** enum zio_flag flags, zio_done_func_t *done, void *private) { enum zio_stage pipeline = ZIO_VDEV_CHILD_PIPELINE; zio_t *zio; ! /* ! * vdev child I/Os do not propagate their error to the parent. ! * Therefore, for correct operation the caller *must* check for ! * and handle the error in the child i/o's done callback. ! * The only exceptions are i/os that we don't care about ! * (OPTIONAL or REPAIR). ! */ ! ASSERT((flags & ZIO_FLAG_OPTIONAL) || (flags & ZIO_FLAG_IO_REPAIR) || ! done != NULL); - /* - * In the common case, where the parent zio was to a normal vdev, - * the child zio must be to a child vdev of that vdev. Otherwise, - * the child zio must be to a top-level vdev. - */ - if (pio->io_vd != NULL && pio->io_vd->vdev_ops != &vdev_indirect_ops) { - ASSERT3P(vd->vdev_parent, ==, pio->io_vd); - } else { - ASSERT3P(vd, ==, vd->vdev_top); - } - if (type == ZIO_TYPE_READ && bp != NULL) { /* * If we have the bp, then the child should perform the * checksum and the parent need not. This pushes error * detection as close to the leaves as possible and --- 1175,1187 ---- enum zio_flag flags, zio_done_func_t *done, void *private) { enum zio_stage pipeline = ZIO_VDEV_CHILD_PIPELINE; zio_t *zio; ! ASSERT(vd->vdev_parent == ! (pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev)); if (type == ZIO_TYPE_READ && bp != NULL) { /* * If we have the bp, then the child should perform the * checksum and the parent need not. This pushes error * detection as close to the leaves as possible and
*** 1088,1103 **** */ pipeline |= ZIO_STAGE_CHECKSUM_VERIFY; pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY; } ! if (vd->vdev_ops->vdev_op_leaf) { ! ASSERT0(vd->vdev_children); offset += VDEV_LABEL_START_SIZE; - } ! flags |= ZIO_VDEV_CHILD_FLAGS(pio); /* * If we've decided to do a repair, the write is not speculative -- * even if the original read was. */ --- 1189,1202 ---- */ pipeline |= ZIO_STAGE_CHECKSUM_VERIFY; pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY; } ! if (vd->vdev_children == 0) offset += VDEV_LABEL_START_SIZE; ! flags |= ZIO_VDEV_CHILD_FLAGS(pio) | ZIO_FLAG_DONT_PROPAGATE; /* * If we've decided to do a repair, the write is not speculative -- * even if the original read was. */
*** 1110,1120 **** * If this is a retried I/O then we ignore it since we will * have already processed the original allocating I/O. */ if (flags & ZIO_FLAG_IO_ALLOCATING && (vd != vd->vdev_top || (flags & ZIO_FLAG_IO_RETRY))) { ! metaslab_class_t *mc = spa_normal_class(pio->io_spa); ASSERT(mc->mc_alloc_throttle_enabled); ASSERT(type == ZIO_TYPE_WRITE); ASSERT(priority == ZIO_PRIORITY_ASYNC_WRITE); ASSERT(!(flags & ZIO_FLAG_IO_REPAIR)); --- 1209,1219 ---- * If this is a retried I/O then we ignore it since we will * have already processed the original allocating I/O. */ if (flags & ZIO_FLAG_IO_ALLOCATING && (vd != vd->vdev_top || (flags & ZIO_FLAG_IO_RETRY))) { ! metaslab_class_t *mc = pio->io_mc; ASSERT(mc->mc_alloc_throttle_enabled); ASSERT(type == ZIO_TYPE_WRITE); ASSERT(priority == ZIO_PRIORITY_ASYNC_WRITE); ASSERT(!(flags & ZIO_FLAG_IO_REPAIR));
*** 1191,1202 **** static int zio_read_bp_init(zio_t *zio) { blkptr_t *bp = zio->io_bp; - ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy); - if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF && zio->io_child_type == ZIO_CHILD_LOGICAL && !(zio->io_flags & ZIO_FLAG_RAW)) { uint64_t psize = BP_IS_EMBEDDED(bp) ? BPE_GET_PSIZE(bp) : BP_GET_PSIZE(bp); --- 1290,1299 ----
*** 1211,1224 **** void *data = abd_borrow_buf(zio->io_abd, psize); decode_embedded_bp_compressed(bp, data); abd_return_buf_copy(zio->io_abd, data, psize); } else { ASSERT(!BP_IS_EMBEDDED(bp)); - ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy); } ! if (!DMU_OT_IS_METADATA(BP_GET_TYPE(bp)) && BP_GET_LEVEL(bp) == 0) zio->io_flags |= ZIO_FLAG_DONT_CACHE; if (BP_GET_TYPE(bp) == DMU_OT_DDT_ZAP) zio->io_flags |= ZIO_FLAG_DONT_CACHE; --- 1308,1320 ---- void *data = abd_borrow_buf(zio->io_abd, psize); decode_embedded_bp_compressed(bp, data); abd_return_buf_copy(zio->io_abd, data, psize); } else { ASSERT(!BP_IS_EMBEDDED(bp)); } ! if (!BP_IS_METADATA(bp)) zio->io_flags |= ZIO_FLAG_DONT_CACHE; if (BP_GET_TYPE(bp) == DMU_OT_DDT_ZAP) zio->io_flags |= ZIO_FLAG_DONT_CACHE;
*** 1302,1315 **** /* * If our children haven't all reached the ready stage, * wait for them and then repeat this pipeline stage. */ ! if (zio_wait_for_children(zio, ZIO_CHILD_LOGICAL_BIT | ! ZIO_CHILD_GANG_BIT, ZIO_WAIT_READY)) { return (ZIO_PIPELINE_STOP); - } if (!IO_IS_ALLOCATING(zio)) return (ZIO_PIPELINE_CONTINUE); if (zio->io_children_ready != NULL) { --- 1398,1410 ---- /* * If our children haven't all reached the ready stage, * wait for them and then repeat this pipeline stage. */ ! if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) || ! zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY)) return (ZIO_PIPELINE_STOP); if (!IO_IS_ALLOCATING(zio)) return (ZIO_PIPELINE_CONTINUE); if (zio->io_children_ready != NULL) {
*** 1347,1358 **** /* Make sure someone doesn't change their mind on overwrites */ ASSERT(BP_IS_EMBEDDED(bp) || MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa)) == BP_GET_NDVAS(bp)); } /* If it's a compressed write that is not raw, compress the buffer. */ ! if (compress != ZIO_COMPRESS_OFF && psize == lsize) { void *cbuf = zio_buf_alloc(lsize); psize = zio_compress_data(compress, zio->io_abd, cbuf, lsize); if (psize == 0 || psize == lsize) { compress = ZIO_COMPRESS_OFF; zio_buf_free(cbuf, lsize); --- 1442,1455 ---- /* Make sure someone doesn't change their mind on overwrites */ ASSERT(BP_IS_EMBEDDED(bp) || MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa)) == BP_GET_NDVAS(bp)); } + DTRACE_PROBE1(zio_compress_ready, zio_t *, zio); /* If it's a compressed write that is not raw, compress the buffer. */ ! if (compress != ZIO_COMPRESS_OFF && psize == lsize && ! ZIO_SHOULD_COMPRESS(zio)) { void *cbuf = zio_buf_alloc(lsize); psize = zio_compress_data(compress, zio->io_abd, cbuf, lsize); if (psize == 0 || psize == lsize) { compress = ZIO_COMPRESS_OFF; zio_buf_free(cbuf, lsize);
*** 1367,1376 **** --- 1464,1479 ---- zio_buf_free(cbuf, lsize); bp->blk_birth = zio->io_txg; zio->io_pipeline = ZIO_INTERLOCK_PIPELINE; ASSERT(spa_feature_is_active(spa, SPA_FEATURE_EMBEDDED_DATA)); + if (zio->io_smartcomp.sc_result != NULL) { + zio->io_smartcomp.sc_result( + zio->io_smartcomp.sc_userinfo, zio); + } else { + ASSERT(zio->io_smartcomp.sc_ask == NULL); + } return (ZIO_PIPELINE_CONTINUE); } else { /* * Round up compressed size up to the ashift * of the smallest-ashift device, and zero the tail.
*** 1394,1412 **** --- 1497,1533 ---- zio_push_transform(zio, cdata, psize, lsize, NULL); } } + if (zio->io_smartcomp.sc_result != NULL) { + zio->io_smartcomp.sc_result( + zio->io_smartcomp.sc_userinfo, zio); + } else { + ASSERT(zio->io_smartcomp.sc_ask == NULL); + } + /* * We were unable to handle this as an override bp, treat * it as a regular write I/O. */ zio->io_bp_override = NULL; *bp = zio->io_bp_orig; zio->io_pipeline = zio->io_orig_pipeline; } else { ASSERT3U(psize, !=, 0); + + /* + * We are here because of: + * - compress == ZIO_COMPRESS_OFF + * - SmartCompression decides don't compress this data + * - this is a RAW-write + * + * In case of RAW-write we should not override "compress" + */ + if ((zio->io_flags & ZIO_FLAG_RAW) == 0) + compress = ZIO_COMPRESS_OFF; } /* * The final pass of spa_sync() must be all rewrites, but the first * few passes offer a trade-off: allocating blocks defers convergence,
*** 1435,1444 **** --- 1556,1569 ---- BP_SET_LEVEL(bp, zp->zp_level); BP_SET_BIRTH(bp, zio->io_txg, 0); } zio->io_pipeline = ZIO_INTERLOCK_PIPELINE; } else { + if (zp->zp_dedup) { + /* check the best-effort dedup setting */ + zio_best_effort_dedup(zio); + } ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER); BP_SET_LSIZE(bp, lsize); BP_SET_TYPE(bp, zp->zp_type); BP_SET_LEVEL(bp, zp->zp_level); BP_SET_PSIZE(bp, psize);
*** 1468,1479 **** if (zio->io_child_type == ZIO_CHILD_LOGICAL) { if (BP_GET_DEDUP(bp)) zio->io_pipeline = ZIO_DDT_FREE_PIPELINE; } - ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy); - return (ZIO_PIPELINE_CONTINUE); } /* * ========================================================================== --- 1593,1602 ----
*** 1504,1514 **** /* * If this is a high priority I/O, then use the high priority taskq if * available. */ ! if (zio->io_priority == ZIO_PRIORITY_NOW && spa->spa_zio_taskq[t][q + 1].stqs_count != 0) q++; ASSERT3U(q, <, ZIO_TASKQ_TYPES); --- 1627,1638 ---- /* * If this is a high priority I/O, then use the high priority taskq if * available. */ ! if ((zio->io_priority == ZIO_PRIORITY_NOW || ! zio->io_priority == ZIO_PRIORITY_SYNC_WRITE) && spa->spa_zio_taskq[t][q + 1].stqs_count != 0) q++; ASSERT3U(q, <, ZIO_TASKQ_TYPES);
*** 1631,1640 **** --- 1755,1765 ---- ASSERT3U(zio->io_queued_timestamp, >, 0); while (zio->io_stage < ZIO_STAGE_DONE) { enum zio_stage pipeline = zio->io_pipeline; + enum zio_stage old_stage = zio->io_stage; enum zio_stage stage = zio->io_stage; int rv; ASSERT(!MUTEX_HELD(&zio->io_lock)); ASSERT(ISP2(stage));
*** 1668,1677 **** --- 1793,1808 ---- rv = zio_pipeline[highbit64(stage) - 1](zio); if (rv == ZIO_PIPELINE_STOP) return; + if (rv == ZIO_PIPELINE_RESTART_STAGE) { + zio->io_stage = old_stage; + (void) zio_issue_async(zio); + return; + } + ASSERT(rv == ZIO_PIPELINE_CONTINUE); } } /*
*** 2148,2160 **** static int zio_gang_issue(zio_t *zio) { blkptr_t *bp = zio->io_bp; ! if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT, ZIO_WAIT_DONE)) { return (ZIO_PIPELINE_STOP); - } ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == zio); ASSERT(zio->io_child_type > ZIO_CHILD_GANG); if (zio->io_child_error[ZIO_CHILD_GANG] == 0) --- 2279,2290 ---- static int zio_gang_issue(zio_t *zio) { blkptr_t *bp = zio->io_bp; ! if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)) return (ZIO_PIPELINE_STOP); ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == zio); ASSERT(zio->io_child_type > ZIO_CHILD_GANG); if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
*** 2206,2216 **** static int zio_write_gang_block(zio_t *pio) { spa_t *spa = pio->io_spa; ! metaslab_class_t *mc = spa_normal_class(spa); blkptr_t *bp = pio->io_bp; zio_t *gio = pio->io_gang_leader; zio_t *zio; zio_gang_node_t *gn, **gnpp; zio_gbh_phys_t *gbh; --- 2336,2346 ---- static int zio_write_gang_block(zio_t *pio) { spa_t *spa = pio->io_spa; ! metaslab_class_t *mc = pio->io_mc; blkptr_t *bp = pio->io_bp; zio_t *gio = pio->io_gang_leader; zio_t *zio; zio_gang_node_t *gn, **gnpp; zio_gbh_phys_t *gbh;
*** 2303,2314 **** zio_t *cio = zio_write(zio, spa, txg, &gbh->zg_blkptr[g], abd_get_offset(pio->io_abd, pio->io_size - resid), lsize, lsize, &zp, zio_write_gang_member_ready, NULL, NULL, zio_write_gang_done, &gn->gn_child[g], pio->io_priority, ! ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark); if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) { ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE); ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA)); /* --- 2433,2447 ---- zio_t *cio = zio_write(zio, spa, txg, &gbh->zg_blkptr[g], abd_get_offset(pio->io_abd, pio->io_size - resid), lsize, lsize, &zp, zio_write_gang_member_ready, NULL, NULL, zio_write_gang_done, &gn->gn_child[g], pio->io_priority, ! ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark, ! &pio->io_smartcomp); + cio->io_mc = mc; + if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) { ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE); ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA)); /*
*** 2471,2483 **** static int zio_ddt_read_done(zio_t *zio) { blkptr_t *bp = zio->io_bp; ! if (zio_wait_for_children(zio, ZIO_CHILD_DDT_BIT, ZIO_WAIT_DONE)) { return (ZIO_PIPELINE_STOP); - } ASSERT(BP_GET_DEDUP(bp)); ASSERT(BP_GET_PSIZE(bp) == zio->io_size); ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL); --- 2604,2615 ---- static int zio_ddt_read_done(zio_t *zio) { blkptr_t *bp = zio->io_bp; ! if (zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)) return (ZIO_PIPELINE_STOP); ASSERT(BP_GET_DEDUP(bp)); ASSERT(BP_GET_PSIZE(bp) == zio->io_size); ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
*** 2505,2514 **** --- 2637,2647 ---- ASSERT(zio->io_vsd == NULL); return (ZIO_PIPELINE_CONTINUE); } + /* ARGSUSED */ static boolean_t zio_ddt_collision(zio_t *zio, ddt_t *ddt, ddt_entry_t *dde) { spa_t *spa = zio->io_spa; boolean_t do_raw = (zio->io_flags & ZIO_FLAG_RAW);
*** 2542,2552 **** blkptr_t blk = *zio->io_bp; int error; ddt_bp_fill(ddp, &blk, ddp->ddp_phys_birth); ! ddt_exit(ddt); /* * Intuitively, it would make more sense to compare * io_abd than io_orig_abd in the raw case since you * don't want to look at any transformations that have --- 2675,2685 ---- blkptr_t blk = *zio->io_bp; int error; ddt_bp_fill(ddp, &blk, ddp->ddp_phys_birth); ! dde_exit(dde); /* * Intuitively, it would make more sense to compare * io_abd than io_orig_abd in the raw case since you * don't want to look at any transformations that have
*** 2573,2583 **** zio->io_orig_size) != 0) error = SET_ERROR(EEXIST); arc_buf_destroy(abuf, &abuf); } ! ddt_enter(ddt); return (error != 0); } } return (B_FALSE); --- 2706,2716 ---- zio->io_orig_size) != 0) error = SET_ERROR(EEXIST); arc_buf_destroy(abuf, &abuf); } ! dde_enter(dde); return (error != 0); } } return (B_FALSE);
*** 2585,2624 **** static void zio_ddt_child_write_ready(zio_t *zio) { int p = zio->io_prop.zp_copies; - ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp); ddt_entry_t *dde = zio->io_private; ddt_phys_t *ddp = &dde->dde_phys[p]; zio_t *pio; if (zio->io_error) return; ! ddt_enter(ddt); ASSERT(dde->dde_lead_zio[p] == zio); ddt_phys_fill(ddp, zio->io_bp); zio_link_t *zl = NULL; while ((pio = zio_walk_parents(zio, &zl)) != NULL) ddt_bp_fill(ddp, pio->io_bp, zio->io_txg); ! ddt_exit(ddt); } static void zio_ddt_child_write_done(zio_t *zio) { int p = zio->io_prop.zp_copies; - ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp); ddt_entry_t *dde = zio->io_private; ddt_phys_t *ddp = &dde->dde_phys[p]; ! ddt_enter(ddt); ASSERT(ddp->ddp_refcnt == 0); ASSERT(dde->dde_lead_zio[p] == zio); dde->dde_lead_zio[p] = NULL; --- 2718,2755 ---- static void zio_ddt_child_write_ready(zio_t *zio) { int p = zio->io_prop.zp_copies; ddt_entry_t *dde = zio->io_private; ddt_phys_t *ddp = &dde->dde_phys[p]; zio_t *pio; if (zio->io_error) return; ! dde_enter(dde); ASSERT(dde->dde_lead_zio[p] == zio); ddt_phys_fill(ddp, zio->io_bp); zio_link_t *zl = NULL; while ((pio = zio_walk_parents(zio, &zl)) != NULL) ddt_bp_fill(ddp, pio->io_bp, zio->io_txg); ! dde_exit(dde); } static void zio_ddt_child_write_done(zio_t *zio) { int p = zio->io_prop.zp_copies; ddt_entry_t *dde = zio->io_private; ddt_phys_t *ddp = &dde->dde_phys[p]; ! dde_enter(dde); ASSERT(ddp->ddp_refcnt == 0); ASSERT(dde->dde_lead_zio[p] == zio); dde->dde_lead_zio[p] = NULL;
*** 2628,2638 **** ddt_phys_addref(ddp); } else { ddt_phys_clear(ddp); } ! ddt_exit(ddt); } static void zio_ddt_ditto_write_done(zio_t *zio) { --- 2759,2769 ---- ddt_phys_addref(ddp); } else { ddt_phys_clear(ddp); } ! dde_exit(dde); } static void zio_ddt_ditto_write_done(zio_t *zio) {
*** 2642,2652 **** ddt_t *ddt = ddt_select(zio->io_spa, bp); ddt_entry_t *dde = zio->io_private; ddt_phys_t *ddp = &dde->dde_phys[p]; ddt_key_t *ddk = &dde->dde_key; ! ddt_enter(ddt); ASSERT(ddp->ddp_refcnt == 0); ASSERT(dde->dde_lead_zio[p] == zio); dde->dde_lead_zio[p] = NULL; --- 2773,2783 ---- ddt_t *ddt = ddt_select(zio->io_spa, bp); ddt_entry_t *dde = zio->io_private; ddt_phys_t *ddp = &dde->dde_phys[p]; ddt_key_t *ddk = &dde->dde_key; ! dde_enter(dde); ASSERT(ddp->ddp_refcnt == 0); ASSERT(dde->dde_lead_zio[p] == zio); dde->dde_lead_zio[p] = NULL;
*** 2657,2667 **** if (ddp->ddp_phys_birth != 0) ddt_phys_free(ddt, ddk, ddp, zio->io_txg); ddt_phys_fill(ddp, bp); } ! ddt_exit(ddt); } static int zio_ddt_write(zio_t *zio) { --- 2788,2798 ---- if (ddp->ddp_phys_birth != 0) ddt_phys_free(ddt, ddk, ddp, zio->io_txg); ddt_phys_fill(ddp, bp); } ! dde_exit(dde); } static int zio_ddt_write(zio_t *zio) {
*** 2680,2693 **** ASSERT(BP_GET_DEDUP(bp)); ASSERT(BP_GET_CHECKSUM(bp) == zp->zp_checksum); ASSERT(BP_IS_HOLE(bp) || zio->io_bp_override); ASSERT(!(zio->io_bp_override && (zio->io_flags & ZIO_FLAG_RAW))); - ddt_enter(ddt); dde = ddt_lookup(ddt, bp, B_TRUE); - ddp = &dde->dde_phys[p]; if (zp->zp_dedup_verify && zio_ddt_collision(zio, ddt, dde)) { /* * If we're using a weak checksum, upgrade to a strong checksum * and try again. If we're already using a strong checksum, * we can't resolve it, so just convert to an ordinary write. --- 2811,2846 ---- ASSERT(BP_GET_DEDUP(bp)); ASSERT(BP_GET_CHECKSUM(bp) == zp->zp_checksum); ASSERT(BP_IS_HOLE(bp) || zio->io_bp_override); ASSERT(!(zio->io_bp_override && (zio->io_flags & ZIO_FLAG_RAW))); dde = ddt_lookup(ddt, bp, B_TRUE); + /* + * If we're not using special tier, for each new DDE that's not on disk: + * disable dedup if we have exhausted "allowed" DDT L2/ARC space + */ + if ((dde->dde_state & DDE_NEW) && !spa->spa_usesc && + (zfs_ddt_limit_type != DDT_NO_LIMIT || zfs_ddt_byte_ceiling != 0)) { + /* turn off dedup if we need to stop DDT growth */ + if (spa_enable_dedup_cap(spa)) { + dde->dde_state |= DDE_DONT_SYNC; + + /* disable dedup and use the ordinary write pipeline */ + zio_pop_transforms(zio); + zp->zp_dedup = zp->zp_dedup_verify = B_FALSE; + zio->io_stage = ZIO_STAGE_OPEN; + zio->io_pipeline = ZIO_WRITE_PIPELINE; + zio->io_bp_override = NULL; + BP_ZERO(bp); + dde_exit(dde); + + return (ZIO_PIPELINE_CONTINUE); + } + } + ASSERT(!(dde->dde_state & DDE_DONT_SYNC)); + if (zp->zp_dedup_verify && zio_ddt_collision(zio, ddt, dde)) { /* * If we're using a weak checksum, upgrade to a strong checksum * and try again. If we're already using a strong checksum, * we can't resolve it, so just convert to an ordinary write.
*** 2703,2716 **** zp->zp_dedup = B_FALSE; BP_SET_DEDUP(bp, B_FALSE); } ASSERT(!BP_GET_DEDUP(bp)); zio->io_pipeline = ZIO_WRITE_PIPELINE; ! ddt_exit(ddt); return (ZIO_PIPELINE_CONTINUE); } ditto_copies = ddt_ditto_copies_needed(ddt, dde, ddp); ASSERT(ditto_copies < SPA_DVAS_PER_BP); if (ditto_copies > ddt_ditto_copies_present(dde) && dde->dde_lead_zio[DDT_PHYS_DITTO] == NULL) { --- 2856,2870 ---- zp->zp_dedup = B_FALSE; BP_SET_DEDUP(bp, B_FALSE); } ASSERT(!BP_GET_DEDUP(bp)); zio->io_pipeline = ZIO_WRITE_PIPELINE; ! dde_exit(dde); return (ZIO_PIPELINE_CONTINUE); } + ddp = &dde->dde_phys[p]; ditto_copies = ddt_ditto_copies_needed(ddt, dde, ddp); ASSERT(ditto_copies < SPA_DVAS_PER_BP); if (ditto_copies > ddt_ditto_copies_present(dde) && dde->dde_lead_zio[DDT_PHYS_DITTO] == NULL) {
*** 2729,2746 **** zio_pop_transforms(zio); zio->io_stage = ZIO_STAGE_OPEN; zio->io_pipeline = ZIO_WRITE_PIPELINE; zio->io_bp_override = NULL; BP_ZERO(bp); ! ddt_exit(ddt); return (ZIO_PIPELINE_CONTINUE); } dio = zio_write(zio, spa, txg, bp, zio->io_orig_abd, zio->io_orig_size, zio->io_orig_size, &czp, NULL, NULL, NULL, zio_ddt_ditto_write_done, dde, zio->io_priority, ! ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark); zio_push_transform(dio, zio->io_abd, zio->io_size, 0, NULL); dde->dde_lead_zio[DDT_PHYS_DITTO] = dio; } --- 2883,2900 ---- zio_pop_transforms(zio); zio->io_stage = ZIO_STAGE_OPEN; zio->io_pipeline = ZIO_WRITE_PIPELINE; zio->io_bp_override = NULL; BP_ZERO(bp); ! dde_exit(dde); return (ZIO_PIPELINE_CONTINUE); } dio = zio_write(zio, spa, txg, bp, zio->io_orig_abd, zio->io_orig_size, zio->io_orig_size, &czp, NULL, NULL, NULL, zio_ddt_ditto_write_done, dde, zio->io_priority, ! ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL); zio_push_transform(dio, zio->io_abd, zio->io_size, 0, NULL); dde->dde_lead_zio[DDT_PHYS_DITTO] = dio; }
*** 2759,2775 **** } else { cio = zio_write(zio, spa, txg, bp, zio->io_orig_abd, zio->io_orig_size, zio->io_orig_size, zp, zio_ddt_child_write_ready, NULL, NULL, zio_ddt_child_write_done, dde, zio->io_priority, ! ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark); zio_push_transform(cio, zio->io_abd, zio->io_size, 0, NULL); dde->dde_lead_zio[p] = cio; } ! ddt_exit(ddt); if (cio) zio_nowait(cio); if (dio) zio_nowait(dio); --- 2913,2929 ---- } else { cio = zio_write(zio, spa, txg, bp, zio->io_orig_abd, zio->io_orig_size, zio->io_orig_size, zp, zio_ddt_child_write_ready, NULL, NULL, zio_ddt_child_write_done, dde, zio->io_priority, ! ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL); zio_push_transform(cio, zio->io_abd, zio->io_size, 0, NULL); dde->dde_lead_zio[p] = cio; } ! dde_exit(dde); if (cio) zio_nowait(cio); if (dio) zio_nowait(dio);
*** 2789,2803 **** ddt_phys_t *ddp; ASSERT(BP_GET_DEDUP(bp)); ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL); - ddt_enter(ddt); freedde = dde = ddt_lookup(ddt, bp, B_TRUE); ddp = ddt_phys_select(dde, bp); ddt_phys_decref(ddp); ! ddt_exit(ddt); return (ZIO_PIPELINE_CONTINUE); } /* --- 2943,2957 ---- ddt_phys_t *ddp; ASSERT(BP_GET_DEDUP(bp)); ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL); freedde = dde = ddt_lookup(ddt, bp, B_TRUE); ddp = ddt_phys_select(dde, bp); + if (ddp) ddt_phys_decref(ddp); ! dde_exit(dde); return (ZIO_PIPELINE_CONTINUE); } /*
*** 2805,2836 **** * Allocate and free blocks * ========================================================================== */ static zio_t * ! zio_io_to_allocate(spa_t *spa) { zio_t *zio; ! ASSERT(MUTEX_HELD(&spa->spa_alloc_lock)); ! zio = avl_first(&spa->spa_alloc_tree); if (zio == NULL) return (NULL); ASSERT(IO_IS_ALLOCATING(zio)); /* * Try to place a reservation for this zio. If we're unable to * reserve then we throttle. */ ! if (!metaslab_class_throttle_reserve(spa_normal_class(spa), zio->io_prop.zp_copies, zio, 0)) { return (NULL); } ! avl_remove(&spa->spa_alloc_tree, zio); ASSERT3U(zio->io_stage, <, ZIO_STAGE_DVA_ALLOCATE); return (zio); } --- 2959,2990 ---- * Allocate and free blocks * ========================================================================== */ static zio_t * ! zio_io_to_allocate(metaslab_class_t *mc) { zio_t *zio; ! ASSERT(MUTEX_HELD(&mc->mc_alloc_lock)); ! zio = avl_first(&mc->mc_alloc_tree); if (zio == NULL) return (NULL); ASSERT(IO_IS_ALLOCATING(zio)); /* * Try to place a reservation for this zio. If we're unable to * reserve then we throttle. */ ! if (!metaslab_class_throttle_reserve(mc, zio->io_prop.zp_copies, zio, 0)) { return (NULL); } ! avl_remove(&mc->mc_alloc_tree, zio); ASSERT3U(zio->io_stage, <, ZIO_STAGE_DVA_ALLOCATE); return (zio); }
*** 2838,2849 **** zio_dva_throttle(zio_t *zio) { spa_t *spa = zio->io_spa; zio_t *nio; if (zio->io_priority == ZIO_PRIORITY_SYNC_WRITE || ! !spa_normal_class(zio->io_spa)->mc_alloc_throttle_enabled || zio->io_child_type == ZIO_CHILD_GANG || zio->io_flags & ZIO_FLAG_NODATA) { return (ZIO_PIPELINE_CONTINUE); } --- 2992,3010 ---- zio_dva_throttle(zio_t *zio) { spa_t *spa = zio->io_spa; zio_t *nio; + /* We need to use parent's MetaslabClass */ + if (zio->io_mc == NULL) { + zio->io_mc = spa_select_class(spa, zio); + if (zio->io_prop.zp_usewbc) + return (ZIO_PIPELINE_CONTINUE); + } + if (zio->io_priority == ZIO_PRIORITY_SYNC_WRITE || ! !zio->io_mc->mc_alloc_throttle_enabled || zio->io_child_type == ZIO_CHILD_GANG || zio->io_flags & ZIO_FLAG_NODATA) { return (ZIO_PIPELINE_CONTINUE); }
*** 2850,2866 **** ASSERT(zio->io_child_type > ZIO_CHILD_GANG); ASSERT3U(zio->io_queued_timestamp, >, 0); ASSERT(zio->io_stage == ZIO_STAGE_DVA_THROTTLE); ! mutex_enter(&spa->spa_alloc_lock); ASSERT(zio->io_type == ZIO_TYPE_WRITE); ! avl_add(&spa->spa_alloc_tree, zio); ! nio = zio_io_to_allocate(zio->io_spa); ! mutex_exit(&spa->spa_alloc_lock); if (nio == zio) return (ZIO_PIPELINE_CONTINUE); if (nio != NULL) { --- 3011,3027 ---- ASSERT(zio->io_child_type > ZIO_CHILD_GANG); ASSERT3U(zio->io_queued_timestamp, >, 0); ASSERT(zio->io_stage == ZIO_STAGE_DVA_THROTTLE); ! mutex_enter(&zio->io_mc->mc_alloc_lock); ASSERT(zio->io_type == ZIO_TYPE_WRITE); ! avl_add(&zio->io_mc->mc_alloc_tree, zio); ! nio = zio_io_to_allocate(zio->io_mc); ! mutex_exit(&zio->io_mc->mc_alloc_lock); if (nio == zio) return (ZIO_PIPELINE_CONTINUE); if (nio != NULL) {
*** 2877,2893 **** } return (ZIO_PIPELINE_STOP); } void ! zio_allocate_dispatch(spa_t *spa) { zio_t *zio; ! mutex_enter(&spa->spa_alloc_lock); ! zio = zio_io_to_allocate(spa); ! mutex_exit(&spa->spa_alloc_lock); if (zio == NULL) return; ASSERT3U(zio->io_stage, ==, ZIO_STAGE_DVA_THROTTLE); ASSERT0(zio->io_error); --- 3038,3054 ---- } return (ZIO_PIPELINE_STOP); } void ! zio_allocate_dispatch(metaslab_class_t *mc) { zio_t *zio; ! mutex_enter(&mc->mc_alloc_lock); ! zio = zio_io_to_allocate(mc); ! mutex_exit(&mc->mc_alloc_lock); if (zio == NULL) return; ASSERT3U(zio->io_stage, ==, ZIO_STAGE_DVA_THROTTLE); ASSERT0(zio->io_error);
*** 2896,2906 **** static int zio_dva_allocate(zio_t *zio) { spa_t *spa = zio->io_spa; ! metaslab_class_t *mc = spa_normal_class(spa); blkptr_t *bp = zio->io_bp; int error; int flags = 0; if (zio->io_gang_leader == NULL) { --- 3057,3068 ---- static int zio_dva_allocate(zio_t *zio) { spa_t *spa = zio->io_spa; ! metaslab_class_t *mc = zio->io_mc; ! blkptr_t *bp = zio->io_bp; int error; int flags = 0; if (zio->io_gang_leader == NULL) {
*** 2912,2941 **** ASSERT0(BP_GET_NDVAS(bp)); ASSERT3U(zio->io_prop.zp_copies, >, 0); ASSERT3U(zio->io_prop.zp_copies, <=, spa_max_replication(spa)); ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp)); ! if (zio->io_flags & ZIO_FLAG_NODATA) { flags |= METASLAB_DONT_THROTTLE; } if (zio->io_flags & ZIO_FLAG_GANG_CHILD) { flags |= METASLAB_GANG_CHILD; } ! if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE) { flags |= METASLAB_ASYNC_ALLOC; } error = metaslab_alloc(spa, mc, zio->io_size, bp, zio->io_prop.zp_copies, zio->io_txg, NULL, flags, &zio->io_alloc_list, zio); if (error != 0) { spa_dbgmsg(spa, "%s: metaslab allocation failure: zio %p, " "size %llu, error %d", spa_name(spa), zio, zio->io_size, error); ! if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE) return (zio_write_gang_block(zio)); zio->io_error = error; } return (ZIO_PIPELINE_CONTINUE); } --- 3074,3122 ---- ASSERT0(BP_GET_NDVAS(bp)); ASSERT3U(zio->io_prop.zp_copies, >, 0); ASSERT3U(zio->io_prop.zp_copies, <=, spa_max_replication(spa)); ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp)); ! if (zio->io_flags & ZIO_FLAG_NODATA || zio->io_prop.zp_usewbc) { flags |= METASLAB_DONT_THROTTLE; } if (zio->io_flags & ZIO_FLAG_GANG_CHILD) { flags |= METASLAB_GANG_CHILD; } ! if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE && ! zio->io_flags & ZIO_FLAG_IO_ALLOCATING) { flags |= METASLAB_ASYNC_ALLOC; } error = metaslab_alloc(spa, mc, zio->io_size, bp, zio->io_prop.zp_copies, zio->io_txg, NULL, flags, &zio->io_alloc_list, zio); + #ifdef _KERNEL + DTRACE_PROBE6(zio_dva_allocate, + uint64_t, DVA_GET_VDEV(&bp->blk_dva[0]), + uint64_t, DVA_GET_VDEV(&bp->blk_dva[1]), + uint64_t, BP_GET_LEVEL(bp), + boolean_t, BP_IS_SPECIAL(bp), + boolean_t, BP_IS_METADATA(bp), + int, error); + #endif + if (error != 0) { spa_dbgmsg(spa, "%s: metaslab allocation failure: zio %p, " "size %llu, error %d", spa_name(spa), zio, zio->io_size, error); ! if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE) { ! if (zio->io_prop.zp_usewbc) { ! zio->io_prop.zp_usewbc = B_FALSE; ! zio->io_prop.zp_usesc = B_FALSE; ! zio->io_mc = spa_normal_class(spa); ! } ! return (zio_write_gang_block(zio)); + } + zio->io_error = error; } return (ZIO_PIPELINE_CONTINUE); }
*** 2989,3013 **** zio_alloc_zil(spa_t *spa, uint64_t txg, blkptr_t *new_bp, blkptr_t *old_bp, uint64_t size, boolean_t *slog) { int error = 1; zio_alloc_list_t io_alloc_list; ASSERT(txg > spa_syncing_txg(spa)); metaslab_trace_init(&io_alloc_list); ! error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, 1, ! txg, old_bp, METASLAB_HINTBP_AVOID, &io_alloc_list, NULL); ! if (error == 0) { *slog = TRUE; ! } else { error = metaslab_alloc(spa, spa_normal_class(spa), size, new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID, &io_alloc_list, NULL); if (error == 0) *slog = FALSE; } metaslab_trace_fini(&io_alloc_list); if (error == 0) { BP_SET_LSIZE(new_bp, size); BP_SET_PSIZE(new_bp, size); --- 3170,3237 ---- zio_alloc_zil(spa_t *spa, uint64_t txg, blkptr_t *new_bp, blkptr_t *old_bp, uint64_t size, boolean_t *slog) { int error = 1; zio_alloc_list_t io_alloc_list; + spa_meta_placement_t *mp = &spa->spa_meta_policy; ASSERT(txg > spa_syncing_txg(spa)); metaslab_trace_init(&io_alloc_list); ! ! /* ! * ZIL blocks are always contiguous (i.e. not gang blocks) ! * so we set the METASLAB_HINTBP_AVOID flag so that they ! * don't "fast gang" when allocating them. ! * If the caller indicates that slog is not to be used ! * (via use_slog) ! * separate allocation class will not indeed be used, ! * independently of whether this is log or special ! */ ! ! if (spa_has_slogs(spa)) { ! error = metaslab_alloc(spa, spa_log_class(spa), ! size, new_bp, 1, txg, old_bp, ! METASLAB_HINTBP_AVOID, &io_alloc_list, NULL); ! ! DTRACE_PROBE2(zio_alloc_zil_log, ! spa_t *, spa, int, error); ! ! if (error == 0) *slog = TRUE; ! } ! ! /* ! * use special when failed to allocate from the regular ! * slog, but only if allowed and if the special used ! * space is below watermarks ! */ ! if (error != 0 && spa_can_special_be_used(spa) && ! mp->spa_sync_to_special != SYNC_TO_SPECIAL_DISABLED) { ! error = metaslab_alloc(spa, spa_special_class(spa), ! size, new_bp, 1, txg, old_bp, ! METASLAB_HINTBP_AVOID, &io_alloc_list, NULL); ! ! DTRACE_PROBE2(zio_alloc_zil_special, ! spa_t *, spa, int, error); ! ! if (error == 0) ! *slog = FALSE; ! } ! ! if (error != 0) { error = metaslab_alloc(spa, spa_normal_class(spa), size, new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID, &io_alloc_list, NULL); + + DTRACE_PROBE2(zio_alloc_zil_normal, + spa_t *, spa, int, error); + if (error == 0) *slog = FALSE; } + metaslab_trace_fini(&io_alloc_list); if (error == 0) { BP_SET_LSIZE(new_bp, size); BP_SET_PSIZE(new_bp, size);
*** 3060,3069 **** --- 3284,3295 ---- zio_vdev_io_start(zio_t *zio) { vdev_t *vd = zio->io_vd; uint64_t align; spa_t *spa = zio->io_spa; + zio_type_t type = zio->io_type; + zio->io_vd_timestamp = gethrtime(); ASSERT(zio->io_error == 0); ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0); if (vd == NULL) {
*** 3076,3124 **** vdev_mirror_ops.vdev_op_io_start(zio); return (ZIO_PIPELINE_STOP); } ASSERT3P(zio->io_logical, !=, zio); - if (zio->io_type == ZIO_TYPE_WRITE) { - ASSERT(spa->spa_trust_config); - if (zio->io_vd->vdev_removing) { - ASSERT(zio->io_flags & - (ZIO_FLAG_PHYSICAL | ZIO_FLAG_SELF_HEAL | - ZIO_FLAG_INDUCE_DAMAGE)); - } - } - - /* - * We keep track of time-sensitive I/Os so that the scan thread - * can quickly react to certain workloads. In particular, we care - * about non-scrubbing, top-level reads and writes with the following - * characteristics: - * - synchronous writes of user data to non-slog devices - * - any reads of user data - * When these conditions are met, adjust the timestamp of spa_last_io - * which allows the scan thread to adjust its workload accordingly. - */ - if (!(zio->io_flags & ZIO_FLAG_SCAN_THREAD) && zio->io_bp != NULL && - vd == vd->vdev_top && !vd->vdev_islog && - zio->io_bookmark.zb_objset != DMU_META_OBJSET && - zio->io_txg != spa_syncing_txg(spa)) { - uint64_t old = spa->spa_last_io; - uint64_t new = ddi_get_lbolt64(); - if (old != new) - (void) atomic_cas_64(&spa->spa_last_io, old, new); - } - align = 1ULL << vd->vdev_top->vdev_ashift; if (!(zio->io_flags & ZIO_FLAG_PHYSICAL) && P2PHASE(zio->io_size, align) != 0) { /* Transform logical writes to be a full physical block size. */ uint64_t asize = P2ROUNDUP(zio->io_size, align); abd_t *abuf = abd_alloc_sametype(zio->io_abd, asize); ASSERT(vd == vd->vdev_top); ! if (zio->io_type == ZIO_TYPE_WRITE) { abd_copy(abuf, zio->io_abd, zio->io_size); abd_zero_off(abuf, zio->io_size, asize - zio->io_size); } zio_push_transform(zio, abuf, asize, asize, zio_subblock); } --- 3302,3321 ---- vdev_mirror_ops.vdev_op_io_start(zio); return (ZIO_PIPELINE_STOP); } ASSERT3P(zio->io_logical, !=, zio); align = 1ULL << vd->vdev_top->vdev_ashift; if (!(zio->io_flags & ZIO_FLAG_PHYSICAL) && P2PHASE(zio->io_size, align) != 0) { /* Transform logical writes to be a full physical block size. */ uint64_t asize = P2ROUNDUP(zio->io_size, align); abd_t *abuf = abd_alloc_sametype(zio->io_abd, asize); ASSERT(vd == vd->vdev_top); ! if (type == ZIO_TYPE_WRITE) { abd_copy(abuf, zio->io_abd, zio->io_size); abd_zero_off(abuf, zio->io_size, asize - zio->io_size); } zio_push_transform(zio, abuf, asize, asize, zio_subblock); }
*** 3137,3147 **** */ ASSERT0(P2PHASE(zio->io_offset, SPA_MINBLOCKSIZE)); ASSERT0(P2PHASE(zio->io_size, SPA_MINBLOCKSIZE)); } ! VERIFY(zio->io_type != ZIO_TYPE_WRITE || spa_writeable(spa)); /* * If this is a repair I/O, and there's no self-healing involved -- * that is, we're just resilvering what we expect to resilver -- * then don't do the I/O unless zio's txg is actually in vd's DTL. --- 3334,3344 ---- */ ASSERT0(P2PHASE(zio->io_offset, SPA_MINBLOCKSIZE)); ASSERT0(P2PHASE(zio->io_size, SPA_MINBLOCKSIZE)); } ! VERIFY(type != ZIO_TYPE_WRITE || spa_writeable(spa)); /* * If this is a repair I/O, and there's no self-healing involved -- * that is, we're just resilvering what we expect to resilver -- * then don't do the I/O unless zio's txg is actually in vd's DTL.
*** 3156,3174 **** */ if ((zio->io_flags & ZIO_FLAG_IO_REPAIR) && !(zio->io_flags & ZIO_FLAG_SELF_HEAL) && zio->io_txg != 0 && /* not a delegated i/o */ !vdev_dtl_contains(vd, DTL_PARTIAL, zio->io_txg, 1)) { ! ASSERT(zio->io_type == ZIO_TYPE_WRITE); zio_vdev_io_bypass(zio); return (ZIO_PIPELINE_CONTINUE); } if (vd->vdev_ops->vdev_op_leaf && ! (zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE)) { ! ! if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio)) return (ZIO_PIPELINE_CONTINUE); if ((zio = vdev_queue_io(zio)) == NULL) return (ZIO_PIPELINE_STOP); --- 3353,3370 ---- */ if ((zio->io_flags & ZIO_FLAG_IO_REPAIR) && !(zio->io_flags & ZIO_FLAG_SELF_HEAL) && zio->io_txg != 0 && /* not a delegated i/o */ !vdev_dtl_contains(vd, DTL_PARTIAL, zio->io_txg, 1)) { ! ASSERT(type == ZIO_TYPE_WRITE); zio_vdev_io_bypass(zio); return (ZIO_PIPELINE_CONTINUE); } if (vd->vdev_ops->vdev_op_leaf && ! (type == ZIO_TYPE_READ || type == ZIO_TYPE_WRITE)) { ! if (type == ZIO_TYPE_READ && vdev_cache_read(zio)) return (ZIO_PIPELINE_CONTINUE); if ((zio = vdev_queue_io(zio)) == NULL) return (ZIO_PIPELINE_STOP);
*** 3175,3185 **** --- 3371,3390 ---- if (!vdev_accessible(vd, zio)) { zio->io_error = SET_ERROR(ENXIO); zio_interrupt(zio); return (ZIO_PIPELINE_STOP); } + + /* + * Insert a fault simulation delay for a particular vdev. + */ + if (zio_faulty_vdev_enabled && + (zio->io_vd->vdev_guid == zio_faulty_vdev_guid)) { + delay(NSEC_TO_TICK(zio_faulty_vdev_delay_us * + (NANOSEC / MICROSEC))); } + } vd->vdev_ops->vdev_op_io_start(zio); return (ZIO_PIPELINE_STOP); }
*** 3188,3205 **** { vdev_t *vd = zio->io_vd; vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops; boolean_t unexpected_error = B_FALSE; ! if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) { return (ZIO_PIPELINE_STOP); - } ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE); if (vd != NULL && vd->vdev_ops->vdev_op_leaf) { - vdev_queue_io_done(zio); if (zio->io_type == ZIO_TYPE_WRITE) vdev_cache_write(zio); --- 3393,3408 ---- { vdev_t *vd = zio->io_vd; vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops; boolean_t unexpected_error = B_FALSE; ! if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)) return (ZIO_PIPELINE_STOP); ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE); if (vd != NULL && vd->vdev_ops->vdev_op_leaf) { vdev_queue_io_done(zio); if (zio->io_type == ZIO_TYPE_WRITE) vdev_cache_write(zio);
*** 3222,3231 **** --- 3425,3440 ---- ops->vdev_op_io_done(zio); if (unexpected_error) VERIFY(vdev_probe(vd, zio) == NULL); + /* + * Measure delta between start and end of the I/O in nanoseconds. + * XXX: Handle overflow. + */ + zio->io_vd_timestamp = gethrtime() - zio->io_vd_timestamp; + return (ZIO_PIPELINE_CONTINUE); } /* * For non-raidz ZIOs, we can just copy aside the bad data read from the
*** 3256,3268 **** static int zio_vdev_io_assess(zio_t *zio) { vdev_t *vd = zio->io_vd; ! if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) { return (ZIO_PIPELINE_STOP); - } if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER)) spa_config_exit(zio->io_spa, SCL_ZIO, zio); if (zio->io_vsd != NULL) { --- 3465,3476 ---- static int zio_vdev_io_assess(zio_t *zio) { vdev_t *vd = zio->io_vd; ! if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)) return (ZIO_PIPELINE_STOP); if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER)) spa_config_exit(zio->io_spa, SCL_ZIO, zio); if (zio->io_vsd != NULL) {
*** 3473,3486 **** { blkptr_t *bp = zio->io_bp; zio_t *pio, *pio_next; zio_link_t *zl = NULL; ! if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT | ZIO_CHILD_DDT_BIT, ! ZIO_WAIT_READY)) { return (ZIO_PIPELINE_STOP); - } if (zio->io_ready) { ASSERT(IO_IS_ALLOCATING(zio)); ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp) || (zio->io_flags & ZIO_FLAG_NOPWRITE)); --- 3681,3693 ---- { blkptr_t *bp = zio->io_bp; zio_t *pio, *pio_next; zio_link_t *zl = NULL; ! if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) || ! zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_READY)) return (ZIO_PIPELINE_STOP); if (zio->io_ready) { ASSERT(IO_IS_ALLOCATING(zio)); ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp) || (zio->io_flags & ZIO_FLAG_NOPWRITE));
*** 3500,3513 **** ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE); /* * We were unable to allocate anything, unreserve and * issue the next I/O to allocate. */ ! metaslab_class_throttle_unreserve( ! spa_normal_class(zio->io_spa), zio->io_prop.zp_copies, zio); ! zio_allocate_dispatch(zio->io_spa); } } mutex_enter(&zio->io_lock); zio->io_state[ZIO_WAIT_READY] = 1; --- 3707,3719 ---- ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE); /* * We were unable to allocate anything, unreserve and * issue the next I/O to allocate. */ ! metaslab_class_throttle_unreserve(zio->io_mc, zio->io_prop.zp_copies, zio); ! zio_allocate_dispatch(zio->io_mc); } } mutex_enter(&zio->io_lock); zio->io_state[ZIO_WAIT_READY] = 1;
*** 3589,3607 **** mutex_enter(&pio->io_lock); metaslab_group_alloc_decrement(zio->io_spa, vd->vdev_id, pio, flags); mutex_exit(&pio->io_lock); ! metaslab_class_throttle_unreserve(spa_normal_class(zio->io_spa), ! 1, pio); /* * Call into the pipeline to see if there is more work that * needs to be done. If there is work to be done it will be * dispatched to another taskq thread. */ ! zio_allocate_dispatch(zio->io_spa); } static int zio_done(zio_t *zio) { --- 3795,3812 ---- mutex_enter(&pio->io_lock); metaslab_group_alloc_decrement(zio->io_spa, vd->vdev_id, pio, flags); mutex_exit(&pio->io_lock); ! metaslab_class_throttle_unreserve(pio->io_mc, 1, pio); /* * Call into the pipeline to see if there is more work that * needs to be done. If there is work to be done it will be * dispatched to another taskq thread. */ ! zio_allocate_dispatch(pio->io_mc); } static int zio_done(zio_t *zio) {
*** 3609,3628 **** zio_t *lio = zio->io_logical; blkptr_t *bp = zio->io_bp; vdev_t *vd = zio->io_vd; uint64_t psize = zio->io_size; zio_t *pio, *pio_next; ! metaslab_class_t *mc = spa_normal_class(spa); zio_link_t *zl = NULL; /* * If our children haven't all completed, * wait for them and then repeat this pipeline stage. */ ! if (zio_wait_for_children(zio, ZIO_CHILD_ALL_BITS, ZIO_WAIT_DONE)) { return (ZIO_PIPELINE_STOP); - } /* * If the allocation throttle is enabled, then update the accounting. * We only track child I/Os that are part of an allocating async * write. We must do this since the allocation is performed --- 3814,3835 ---- zio_t *lio = zio->io_logical; blkptr_t *bp = zio->io_bp; vdev_t *vd = zio->io_vd; uint64_t psize = zio->io_size; zio_t *pio, *pio_next; ! metaslab_class_t *mc = zio->io_mc; zio_link_t *zl = NULL; /* * If our children haven't all completed, * wait for them and then repeat this pipeline stage. */ ! if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) || ! zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) || ! zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) || ! zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)) return (ZIO_PIPELINE_STOP); /* * If the allocation throttle is enabled, then update the accounting. * We only track child I/Os that are part of an allocating async * write. We must do this since the allocation is performed
*** 3908,3917 **** --- 4115,4152 ---- } return (ZIO_PIPELINE_STOP); } + zio_t * + zio_wbc(zio_type_t type, vdev_t *vd, abd_t *data, + uint64_t size, uint64_t offset) + { + zio_t *zio = NULL; + + switch (type) { + case ZIO_TYPE_WRITE: + zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size, + size, NULL, NULL, ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE, + ZIO_FLAG_PHYSICAL, vd, offset, + NULL, ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE); + break; + case ZIO_TYPE_READ: + zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size, + size, NULL, NULL, ZIO_TYPE_READ, ZIO_PRIORITY_ASYNC_READ, + ZIO_FLAG_DONT_CACHE | ZIO_FLAG_PHYSICAL, vd, offset, + NULL, ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE); + break; + default: + ASSERT(0); + } + + zio->io_prop.zp_checksum = ZIO_CHECKSUM_OFF; + + return (zio); + } + /* * ========================================================================== * I/O pipeline definition * ========================================================================== */