big-one Udiff usr/src/uts/common/fs/zfs/vdev.c

Print this page

NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9940 Appliance requires a reboot after JBOD power failure or disconnecting all SAS cables
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9554 dsl_scan.c internals contain some confusingly similar function names for handling the dataset and block sorting queues
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9562 Attaching a vdev while resilver/scrub is running causes panic.
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5736 implement autoreplace matching based on FRU slot number
NEX-6200 hot spares are not reactivated after reinserting into enclosure
NEX-9403 need to update FRU for spare and l2cache devices
NEX-9404 remove lofi autoreplace support from syseventd
NEX-9409 hotsparing doesn't work for vdevs without FRU
NEX-9424 zfs`vdev_online() needs better notification about state changes
Portions contributed by: Alek Pinchuk <alek@nexenta.com>
Portions contributed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8206 dtrace helpers leak when cfork() fails
Reviewed by: Rick McNeal <rick.mcneal@nexeneta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8507 erroneous check in vdev_type_is_ddt()
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4584 System panic when adding special vdev to a pool that does not support feature flags
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-2846 Enable Automatic/Intelligent Hot Sparing capability
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4940 Special Vdev operation in presence (or absense) of IO Errors
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3729 KRRP changes mess up iostat(1M)
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
NEX-4204 Removing vdev while on-demand trim is ongoing locks up pool
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3212 remove vdev prop object type from dmu.h, p2 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3025 support root pools on EFI labeled disks
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-1142 move rwlock to vdev to protect vdev_tsd
not just ldi handle.
This way we serialize open/close, yet allow parallel I/O.
NEX-801 If a block pointer is corrupt read or write may crash
If block pointer is corrupt in such a way that vdev id of one of the
ditto blocks is wrong (out of range), zio_vdev_io_start or zio_vdev_io_done
may trip over it and crash.
This changeset takes care of this by claiming that an invalid vdev is
neither readable, nor writeable.
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

@@ -19,25 +19,23 @@
  * CDDL HEADER END
  */
 
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
- * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
- * Copyright 2017 Nexenta Systems, Inc.
+ * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
+ * Copyright 2018 Nexenta Systems, Inc.
  * Copyright (c) 2014 Integros [integros.com]
  * Copyright 2016 Toomas Soome <tsoome@me.com>
  * Copyright 2017 Joyent, Inc.
  */
 
 #include <sys/zfs_context.h>
 #include <sys/fm/fs/zfs.h>
 #include <sys/spa.h>
 #include <sys/spa_impl.h>
-#include <sys/bpobj.h>
 #include <sys/dmu.h>
 #include <sys/dmu_tx.h>
-#include <sys/dsl_dir.h>
 #include <sys/vdev_impl.h>
 #include <sys/uberblock_impl.h>
 #include <sys/metaslab.h>
 #include <sys/metaslab_impl.h>
 #include <sys/space_map.h>

@@ -62,98 +60,27 @@
         &vdev_spare_ops,
         &vdev_disk_ops,
         &vdev_file_ops,
         &vdev_missing_ops,
         &vdev_hole_ops,
-        &vdev_indirect_ops,
         NULL
 };
 
 /* maximum scrub/resilver I/O queue per leaf vdev */
 int zfs_scrub_limit = 10;
 
 /*
+ * alpha for exponential moving average of I/O latency (in 1/10th of a percent)
+ */
+int zfs_vs_latency_alpha = 100;
+
+/*
  * When a vdev is added, it will be divided into approximately (but no
  * more than) this number of metaslabs.
  */
 int metaslabs_per_vdev = 200;
 
-boolean_t vdev_validate_skip = B_FALSE;
-
-/*PRINTFLIKE2*/
-void
-vdev_dbgmsg(vdev_t *vd, const char *fmt, ...)
-{
-        va_list adx;
-        char buf[256];
-
-        va_start(adx, fmt);
-        (void) vsnprintf(buf, sizeof (buf), fmt, adx);
-        va_end(adx);
-
-        if (vd->vdev_path != NULL) {
-                zfs_dbgmsg("%s vdev '%s': %s", vd->vdev_ops->vdev_op_type,
-                    vd->vdev_path, buf);
-        } else {
-                zfs_dbgmsg("%s-%llu vdev (guid %llu): %s",
-                    vd->vdev_ops->vdev_op_type,
-                    (u_longlong_t)vd->vdev_id,
-                    (u_longlong_t)vd->vdev_guid, buf);
-        }
-}
-
-void
-vdev_dbgmsg_print_tree(vdev_t *vd, int indent)
-{
-        char state[20];
-
-        if (vd->vdev_ishole || vd->vdev_ops == &vdev_missing_ops) {
-                zfs_dbgmsg("%*svdev %u: %s", indent, "", vd->vdev_id,
-                    vd->vdev_ops->vdev_op_type);
-                return;
-        }
-
-        switch (vd->vdev_state) {
-        case VDEV_STATE_UNKNOWN:
-                (void) snprintf(state, sizeof (state), "unknown");
-                break;
-        case VDEV_STATE_CLOSED:
-                (void) snprintf(state, sizeof (state), "closed");
-                break;
-        case VDEV_STATE_OFFLINE:
-                (void) snprintf(state, sizeof (state), "offline");
-                break;
-        case VDEV_STATE_REMOVED:
-                (void) snprintf(state, sizeof (state), "removed");
-                break;
-        case VDEV_STATE_CANT_OPEN:
-                (void) snprintf(state, sizeof (state), "can't open");
-                break;
-        case VDEV_STATE_FAULTED:
-                (void) snprintf(state, sizeof (state), "faulted");
-                break;
-        case VDEV_STATE_DEGRADED:
-                (void) snprintf(state, sizeof (state), "degraded");
-                break;
-        case VDEV_STATE_HEALTHY:
-                (void) snprintf(state, sizeof (state), "healthy");
-                break;
-        default:
-                (void) snprintf(state, sizeof (state), "<state %u>",
-                    (uint_t)vd->vdev_state);
-        }
-
-        zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, %s", indent,
-            "", vd->vdev_id, vd->vdev_ops->vdev_op_type,
-            vd->vdev_islog ? " (log)" : "",
-            (u_longlong_t)vd->vdev_guid,
-            vd->vdev_path ? vd->vdev_path : "N/A", state);
-
-        for (uint64_t i = 0; i < vd->vdev_children; i++)
-                vdev_dbgmsg_print_tree(vd->vdev_child[i], indent + 2);
-}
-
 /*
  * Given a vdev type, return the appropriate ops vector.
  */
 static vdev_ops_t *
 vdev_getops(const char *type)

@@ -165,10 +92,16 @@
                         break;
 
         return (ops);
 }
 
+boolean_t
+vdev_is_special(vdev_t *vd)
+{
+        return (vd ? vd->vdev_isspecial : B_FALSE);
+}
+
 /*
  * Default asize function: return the MAX of psize with the asize of
  * all children.  This is what's used by anything other than RAID-Z.
  */
 uint64_t

@@ -310,10 +243,13 @@
         }
 
         pvd->vdev_child = newchild;
         pvd->vdev_child[id] = cvd;
 
+        cvd->vdev_isspecial_child =
+            (pvd->vdev_isspecial || pvd->vdev_isspecial_child);
+
         cvd->vdev_top = (pvd->vdev_top ? pvd->vdev_top: cvd);
         ASSERT(cvd->vdev_top->vdev_parent->vdev_parent == NULL);
 
         /*
          * Walk up all ancestors to update guid sum.

@@ -391,14 +327,12 @@
  */
 vdev_t *
 vdev_alloc_common(spa_t *spa, uint_t id, uint64_t guid, vdev_ops_t *ops)
 {
         vdev_t *vd;
-        vdev_indirect_config_t *vic;
 
         vd = kmem_zalloc(sizeof (vdev_t), KM_SLEEP);
-        vic = &vd->vdev_indirect_config;
 
         if (spa->spa_root_vdev == NULL) {
                 ASSERT(ops == &vdev_root_ops);
                 spa->spa_root_vdev = vd;
                 spa->spa_load_guid = spa_generate_guid(NULL);

@@ -425,22 +359,19 @@
         vd->vdev_guid = guid;
         vd->vdev_guid_sum = guid;
         vd->vdev_ops = ops;
         vd->vdev_state = VDEV_STATE_CLOSED;
         vd->vdev_ishole = (ops == &vdev_hole_ops);
-        vic->vic_prev_indirect_vdev = UINT64_MAX;
 
-        rw_init(&vd->vdev_indirect_rwlock, NULL, RW_DEFAULT, NULL);
-        mutex_init(&vd->vdev_obsolete_lock, NULL, MUTEX_DEFAULT, NULL);
-        vd->vdev_obsolete_segments = range_tree_create(NULL, NULL);
-
         mutex_init(&vd->vdev_dtl_lock, NULL, MUTEX_DEFAULT, NULL);
         mutex_init(&vd->vdev_stat_lock, NULL, MUTEX_DEFAULT, NULL);
         mutex_init(&vd->vdev_probe_lock, NULL, MUTEX_DEFAULT, NULL);
-        mutex_init(&vd->vdev_queue_lock, NULL, MUTEX_DEFAULT, NULL);
+        mutex_init(&vd->vdev_scan_io_queue_lock, NULL, MUTEX_DEFAULT, NULL);
+        rw_init(&vd->vdev_tsd_lock, NULL, RW_DEFAULT, NULL);
         for (int t = 0; t < DTL_TYPES; t++) {
-                vd->vdev_dtl[t] = range_tree_create(NULL, NULL);
+                vd->vdev_dtl[t] = range_tree_create(NULL, NULL,
+                    &vd->vdev_dtl_lock);
         }
         txg_list_create(&vd->vdev_ms_list, spa,
             offsetof(struct metaslab, ms_txg_node));
         txg_list_create(&vd->vdev_dtl_list, spa,
             offsetof(struct vdev, vdev_dtl_node));

@@ -460,13 +391,13 @@
 vdev_alloc(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent, uint_t id,
     int alloctype)
 {
         vdev_ops_t *ops;
         char *type;
-        uint64_t guid = 0, islog, nparity;
+        uint64_t guid = 0, nparity;
+        uint64_t isspecial = 0, islog = 0;
         vdev_t *vd;
-        vdev_indirect_config_t *vic;
 
         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
 
         if (nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type) != 0)
                 return (SET_ERROR(EINVAL));

@@ -505,15 +436,21 @@
                 return (SET_ERROR(EINVAL));
 
         /*
          * Determine whether we're a log vdev.
          */
-        islog = 0;
         (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_LOG, &islog);
         if (islog && spa_version(spa) < SPA_VERSION_SLOGS)
                 return (SET_ERROR(ENOTSUP));
 
+        /*
+         * Determine whether we're a special vdev.
+         */
+        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_SPECIAL, &isspecial);
+        if (isspecial && spa_version(spa) < SPA_VERSION_FEATURES)
+                return (SET_ERROR(ENOTSUP));
+
         if (ops == &vdev_hole_ops && spa_version(spa) < SPA_VERSION_HOLES)
                 return (SET_ERROR(ENOTSUP));
 
         /*
          * Set the nparity property for RAID-Z vdevs.

@@ -550,14 +487,16 @@
                 nparity = 0;
         }
         ASSERT(nparity != -1ULL);
 
         vd = vdev_alloc_common(spa, id, guid, ops);
-        vic = &vd->vdev_indirect_config;
 
         vd->vdev_islog = islog;
+        vd->vdev_isspecial = isspecial;
         vd->vdev_nparity = nparity;
+        vd->vdev_isspecial_child = (parent != NULL &&
+            (parent->vdev_isspecial || parent->vdev_isspecial_child));
 
         if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &vd->vdev_path) == 0)
                 vd->vdev_path = spa_strdup(vd->vdev_path);
         if (nvlist_lookup_string(nv, ZPOOL_CONFIG_DEVID, &vd->vdev_devid) == 0)
                 vd->vdev_devid = spa_strdup(vd->vdev_devid);

@@ -565,27 +504,55 @@
             &vd->vdev_physpath) == 0)
                 vd->vdev_physpath = spa_strdup(vd->vdev_physpath);
         if (nvlist_lookup_string(nv, ZPOOL_CONFIG_FRU, &vd->vdev_fru) == 0)
                 vd->vdev_fru = spa_strdup(vd->vdev_fru);
 
+#ifdef _KERNEL
+        if (vd->vdev_path) {
+                char dev_path[MAXPATHLEN];
+                char *last_slash = NULL;
+                kstat_t *exist = NULL;
+
+                if (strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_DISK) == 0)
+                        last_slash = strrchr(vd->vdev_path, '/');
+
+                (void) sprintf(dev_path, "%s:%s", spa->spa_name,
+                    last_slash != NULL ? last_slash + 1 : vd->vdev_path);
+
+                exist = kstat_hold_byname("zfs", 0, dev_path, ALL_ZONES);
+
+                if (!exist) {
+                        vd->vdev_iokstat = kstat_create("zfs", 0, dev_path,
+                            "zfs", KSTAT_TYPE_IO, 1, 0);
+
+                        if (vd->vdev_iokstat) {
+                                vd->vdev_iokstat->ks_lock =
+                                    &spa->spa_iokstat_lock;
+                                kstat_install(vd->vdev_iokstat);
+                        }
+                } else {
+                        kstat_rele(exist);
+                }
+        }
+#endif
+
         /*
          * Set the whole_disk property.  If it's not specified, leave the value
          * as -1.
          */
         if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK,
             &vd->vdev_wholedisk) != 0)
                 vd->vdev_wholedisk = -1ULL;
 
-        ASSERT0(vic->vic_mapping_object);
-        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_OBJECT,
-            &vic->vic_mapping_object);
-        ASSERT0(vic->vic_births_object);
-        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_BIRTHS,
-            &vic->vic_births_object);
-        ASSERT3U(vic->vic_prev_indirect_vdev, ==, UINT64_MAX);
-        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_PREV_INDIRECT_VDEV,
-            &vic->vic_prev_indirect_vdev);
+        /*
+         * Set the is_ssd property.  If it's not specified it means the media
+         * is not SSD or the request failed and we assume it's not.
+         */
+        if (nvlist_lookup_boolean(nv, ZPOOL_CONFIG_IS_SSD) == 0)
+                vd->vdev_is_ssd = B_TRUE;
+        else
+                vd->vdev_is_ssd = B_FALSE;
 
         /*
          * Look for the 'not present' flag.  This will only be set if the device
          * was not present at the time of import.
          */

@@ -621,16 +588,19 @@
         } else {
                 ASSERT0(vd->vdev_top_zap);
         }
 
         if (parent && !parent->vdev_parent && alloctype != VDEV_ALLOC_ATTACH) {
+                metaslab_class_t *mc = isspecial ? spa_special_class(spa) :
+                    (islog ? spa_log_class(spa) : spa_normal_class(spa));
+
                 ASSERT(alloctype == VDEV_ALLOC_LOAD ||
                     alloctype == VDEV_ALLOC_ADD ||
                     alloctype == VDEV_ALLOC_SPLIT ||
                     alloctype == VDEV_ALLOC_ROOTPOOL);
-                vd->vdev_mg = metaslab_group_create(islog ?
-                    spa_log_class(spa) : spa_normal_class(spa), vd);
+
+                vd->vdev_mg = metaslab_group_create(mc, vd);
         }
 
         if (vd->vdev_ops->vdev_op_leaf &&
             (alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_SPLIT)) {
                 (void) nvlist_lookup_uint64(nv,

@@ -708,10 +678,20 @@
 vdev_free(vdev_t *vd)
 {
         spa_t *spa = vd->vdev_spa;
 
         /*
+         * Scan queues are normally destroyed at the end of a scan. If the
+         * queue exists here, that implies the vdev is being removed while
+         * the scan is still running.
+         */
+        if (vd->vdev_scan_io_queue != NULL) {
+                dsl_scan_io_queue_destroy(vd->vdev_scan_io_queue);
+                vd->vdev_scan_io_queue = NULL;
+        }
+
+        /*
          * vdev_free() implies closing the vdev first.  This is simpler than
          * trying to ensure complicated semantics for all callers.
          */
         vdev_close(vd);

@@ -775,35 +755,25 @@
                 range_tree_vacate(vd->vdev_dtl[t], NULL, NULL);
                 range_tree_destroy(vd->vdev_dtl[t]);
         }
         mutex_exit(&vd->vdev_dtl_lock);
 
-        EQUIV(vd->vdev_indirect_births != NULL,
-            vd->vdev_indirect_mapping != NULL);
-        if (vd->vdev_indirect_births != NULL) {
-                vdev_indirect_mapping_close(vd->vdev_indirect_mapping);
-                vdev_indirect_births_close(vd->vdev_indirect_births);
+        if (vd->vdev_iokstat) {
+                kstat_delete(vd->vdev_iokstat);
+                vd->vdev_iokstat = NULL;
         }
-
-        if (vd->vdev_obsolete_sm != NULL) {
-                ASSERT(vd->vdev_removing ||
-                    vd->vdev_ops == &vdev_indirect_ops);
-                space_map_close(vd->vdev_obsolete_sm);
-                vd->vdev_obsolete_sm = NULL;
-        }
-        range_tree_destroy(vd->vdev_obsolete_segments);
-        rw_destroy(&vd->vdev_indirect_rwlock);
-        mutex_destroy(&vd->vdev_obsolete_lock);
-
-        mutex_destroy(&vd->vdev_queue_lock);
         mutex_destroy(&vd->vdev_dtl_lock);
         mutex_destroy(&vd->vdev_stat_lock);
         mutex_destroy(&vd->vdev_probe_lock);
+        mutex_destroy(&vd->vdev_scan_io_queue_lock);
+        rw_destroy(&vd->vdev_tsd_lock);
 
         if (vd == spa->spa_root_vdev)
                 spa->spa_root_vdev = NULL;
 
+        ASSERT3P(vd->vdev_scan_io_queue, ==, NULL);
+
         kmem_free(vd, sizeof (vdev_t));
 }
 
 /*
  * Transfer top-level vdev state from svd to tvd.

@@ -869,10 +839,16 @@
         tvd->vdev_deflate_ratio = svd->vdev_deflate_ratio;
         svd->vdev_deflate_ratio = 0;
 
         tvd->vdev_islog = svd->vdev_islog;
         svd->vdev_islog = 0;
+
+        tvd->vdev_isspecial = svd->vdev_isspecial;
+        svd->vdev_isspecial = 0;
+        svd->vdev_isspecial_child = tvd->vdev_isspecial;
+
+        dsl_scan_io_queue_vdev_xfer(svd, tvd);
 }
 
 static void
 vdev_top_update(vdev_t *tvd, vdev_t *vd)
 {

@@ -900,11 +876,10 @@
         mvd = vdev_alloc_common(spa, cvd->vdev_id, 0, ops);
 
         mvd->vdev_asize = cvd->vdev_asize;
         mvd->vdev_min_asize = cvd->vdev_min_asize;
         mvd->vdev_max_asize = cvd->vdev_max_asize;
-        mvd->vdev_psize = cvd->vdev_psize;
         mvd->vdev_ashift = cvd->vdev_ashift;
         mvd->vdev_state = cvd->vdev_state;
         mvd->vdev_crtxg = cvd->vdev_crtxg;
 
         vdev_remove_child(pvd, cvd);

@@ -981,10 +956,19 @@
         if (vd->vdev_ms_shift == 0)
                 return (0);
 
         ASSERT(!vd->vdev_ishole);
 
+        /*
+         * Compute the raidz-deflation ratio.  Note, we hard-code
+         * in 128k (1 << 17) because it is the "typical" blocksize.
+         * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
+         * otherwise it would inconsistently account for existing bp's.
+         */
+        vd->vdev_deflate_ratio = (1 << 17) /
+            (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
+
         ASSERT(oldc <= newc);
 
         mspp = kmem_zalloc(newc * sizeof (*mspp), KM_SLEEP);
 
         if (oldc != 0) {

@@ -996,34 +980,23 @@
         vd->vdev_ms_count = newc;
 
         for (m = oldc; m < newc; m++) {
                 uint64_t object = 0;
 
-                /*
-                 * vdev_ms_array may be 0 if we are creating the "fake"
-                 * metaslabs for an indirect vdev for zdb's leak detection.
-                 * See zdb_leak_init().
-                 */
-                if (txg == 0 && vd->vdev_ms_array != 0) {
+                if (txg == 0) {
                         error = dmu_read(mos, vd->vdev_ms_array,
                             m * sizeof (uint64_t), sizeof (uint64_t), &object,
                             DMU_READ_PREFETCH);
-                        if (error != 0) {
-                                vdev_dbgmsg(vd, "unable to read the metaslab "
-                                    "array [error=%d]", error);
+                        if (error)
                                 return (error);
                         }
-                }
 
                 error = metaslab_init(vd->vdev_mg, m, object, txg,
                     &(vd->vdev_ms[m]));
-                if (error != 0) {
-                        vdev_dbgmsg(vd, "metaslab_init failed [error=%d]",
-                            error);
+                if (error)
                         return (error);
                 }
-        }
 
         if (txg == 0)
                 spa_config_enter(spa, SCL_ALLOC, FTAG, RW_WRITER);
 
         /*

@@ -1041,26 +1014,24 @@
 }
 
 void
 vdev_metaslab_fini(vdev_t *vd)
 {
-        if (vd->vdev_ms != NULL) {
+        uint64_t m;
                 uint64_t count = vd->vdev_ms_count;
 
+        if (vd->vdev_ms != NULL) {
                 metaslab_group_passivate(vd->vdev_mg);
-                for (uint64_t m = 0; m < count; m++) {
+                for (m = 0; m < count; m++) {
                         metaslab_t *msp = vd->vdev_ms[m];
 
                         if (msp != NULL)
                                 metaslab_fini(msp);
                 }
                 kmem_free(vd->vdev_ms, count * sizeof (metaslab_t *));
                 vd->vdev_ms = NULL;
-
-                vd->vdev_ms_count = 0;
         }
-        ASSERT0(vd->vdev_ms_count);
 }
 
 typedef struct vdev_probe_stats {
         boolean_t       vps_readable;
         boolean_t       vps_writeable;

@@ -1100,11 +1071,10 @@
                 if (vdev_readable(vd) &&
                     (vdev_writeable(vd) || !spa_writeable(spa))) {
                         zio->io_error = 0;
                 } else {
                         ASSERT(zio->io_error != 0);
-                        vdev_dbgmsg(vd, "failed probe");
                         zfs_ereport_post(FM_EREPORT_ZFS_PROBE_FAILURE,
                             spa, vd, NULL, 0, 0);
                         zio->io_error = SET_ERROR(ENXIO);
                 }

@@ -1268,25 +1238,10 @@
 
         taskq_destroy(tq);
 }
 
 /*
- * Compute the raidz-deflation ratio.  Note, we hard-code
- * in 128k (1 << 17) because it is the "typical" blocksize.
- * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
- * otherwise it would inconsistently account for existing bp's.
- */
-static void
-vdev_set_deflate_ratio(vdev_t *vd)
-{
-        if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) {
-                vd->vdev_deflate_ratio = (1 << 17) /
-                    (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
-        }
-}
-
-/*
  * Prepare a virtual device for access.
  */
 int
 vdev_open(vdev_t *vd)
 {

@@ -1307,14 +1262,15 @@
         vd->vdev_cant_read = B_FALSE;
         vd->vdev_cant_write = B_FALSE;
         vd->vdev_min_asize = vdev_get_min_asize(vd);
 
         /*
-         * If this vdev is not removed, check its fault status.  If it's
-         * faulted, bail out of the open.
+         * If vdev isn't removed and is faulted for reasons other than failed
+         * open, or if it's offline - bail out.
          */
-        if (!vd->vdev_removed && vd->vdev_faulted) {
+        if (!vd->vdev_removed && vd->vdev_faulted &&
+            vd->vdev_label_aux != VDEV_AUX_OPEN_FAILED) {
                 ASSERT(vd->vdev_children == 0);
                 ASSERT(vd->vdev_label_aux == VDEV_AUX_ERR_EXCEEDED ||
                     vd->vdev_label_aux == VDEV_AUX_EXTERNAL);
                 vdev_set_state(vd, B_TRUE, VDEV_STATE_FAULTED,
                     vd->vdev_label_aux);

@@ -1338,17 +1294,12 @@
         if (error) {
                 if (vd->vdev_removed &&
                     vd->vdev_stat.vs_aux != VDEV_AUX_OPEN_FAILED)
                         vd->vdev_removed = B_FALSE;
 
-                if (vd->vdev_stat.vs_aux == VDEV_AUX_CHILDREN_OFFLINE) {
-                        vdev_set_state(vd, B_TRUE, VDEV_STATE_OFFLINE,
-                            vd->vdev_stat.vs_aux);
-                } else {
                         vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
                             vd->vdev_stat.vs_aux);
-                }
                 return (error);
         }
 
         vd->vdev_removed = B_FALSE;

@@ -1504,53 +1455,46 @@
 /*
  * Called once the vdevs are all opened, this routine validates the label
  * contents. This needs to be done before vdev_load() so that we don't
  * inadvertently do repair I/Os to the wrong device.
  *
+ * If 'strict' is false ignore the spa guid check. This is necessary because
+ * if the machine crashed during a re-guid the new guid might have been written
+ * to all of the vdev labels, but not the cached config. The strict check
+ * will be performed when the pool is opened again using the mos config.
+ *
  * This function will only return failure if one of the vdevs indicates that it
  * has since been destroyed or exported.  This is only possible if
  * /etc/zfs/zpool.cache was readonly at the time.  Otherwise, the vdev state
  * will be updated but the function will return 0.
  */
 int
-vdev_validate(vdev_t *vd)
+vdev_validate(vdev_t *vd, boolean_t strict)
 {
         spa_t *spa = vd->vdev_spa;
         nvlist_t *label;
-        uint64_t guid = 0, aux_guid = 0, top_guid;
+        uint64_t guid = 0, top_guid;
         uint64_t state;
-        nvlist_t *nvl;
-        uint64_t txg;
 
-        if (vdev_validate_skip)
-                return (0);
-
-        for (uint64_t c = 0; c < vd->vdev_children; c++)
-                if (vdev_validate(vd->vdev_child[c]) != 0)
+        for (int c = 0; c < vd->vdev_children; c++)
+                if (vdev_validate(vd->vdev_child[c], strict) != 0)
                         return (SET_ERROR(EBADF));
 
         /*
          * If the device has already failed, or was marked offline, don't do
          * any further validation.  Otherwise, label I/O will fail and we will
          * overwrite the previous state.
          */
-        if (!vd->vdev_ops->vdev_op_leaf || !vdev_readable(vd))
-                return (0);
+        if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) {
+                uint64_t aux_guid = 0;
+                nvlist_t *nvl;
+                uint64_t txg = spa_last_synced_txg(spa) != 0 ?
+                    spa_last_synced_txg(spa) : -1ULL;
 
-        /*
-         * If we are performing an extreme rewind, we allow for a label that
-         * was modified at a point after the current txg.
-         */
-        if (spa->spa_extreme_rewind || spa_last_synced_txg(spa) == 0)
-                txg = UINT64_MAX;
-        else
-                txg = spa_last_synced_txg(spa);
-
         if ((label = vdev_label_read_config(vd, txg)) == NULL) {
                 vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
                     VDEV_AUX_BAD_LABEL);
-                vdev_dbgmsg(vd, "vdev_validate: failed reading config");
                 return (0);
         }
 
         /*
          * Determine if this vdev has been split off into another

@@ -1559,113 +1503,56 @@
         if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_SPLIT_GUID,
             &aux_guid) == 0 && aux_guid == spa_guid(spa)) {
                 vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
                     VDEV_AUX_SPLIT_POOL);
                 nvlist_free(label);
-                vdev_dbgmsg(vd, "vdev_validate: vdev split into other pool");
                 return (0);
         }
 
-        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID, &guid) != 0) {
+                if (strict && (nvlist_lookup_uint64(label,
+                    ZPOOL_CONFIG_POOL_GUID, &guid) != 0 ||
+                    guid != spa_guid(spa))) {
                 vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
                     VDEV_AUX_CORRUPT_DATA);
                 nvlist_free(label);
-                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
-                    ZPOOL_CONFIG_POOL_GUID);
                 return (0);
         }
 
-        /*
-         * If config is not trusted then ignore the spa guid check. This is
-         * necessary because if the machine crashed during a re-guid the new
-         * guid might have been written to all of the vdev labels, but not the
-         * cached config. The check will be performed again once we have the
-         * trusted config from the MOS.
-         */
-        if (spa->spa_trust_config && guid != spa_guid(spa)) {
-                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
-                    VDEV_AUX_CORRUPT_DATA);
-                nvlist_free(label);
-                vdev_dbgmsg(vd, "vdev_validate: vdev label pool_guid doesn't "
-                    "match config (%llu != %llu)", (u_longlong_t)guid,
-                    (u_longlong_t)spa_guid(spa));
-                return (0);
-        }
-
         if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_VDEV_TREE, &nvl)
             != 0 || nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_ORIG_GUID,
             &aux_guid) != 0)
                 aux_guid = 0;
 
-        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, &guid) != 0) {
-                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
-                    VDEV_AUX_CORRUPT_DATA);
-                nvlist_free(label);
-                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
-                    ZPOOL_CONFIG_GUID);
-                return (0);
-        }
-
-        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID, &top_guid)
-            != 0) {
-                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
-                    VDEV_AUX_CORRUPT_DATA);
-                nvlist_free(label);
-                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
-                    ZPOOL_CONFIG_TOP_GUID);
-                return (0);
-        }
-
         /*
-         * If this vdev just became a top-level vdev because its sibling was
-         * detached, it will have adopted the parent's vdev guid -- but the
-         * label may or may not be on disk yet. Fortunately, either version
-         * of the label will have the same top guid, so if we're a top-level
-         * vdev, we can safely compare to that instead.
-         * However, if the config comes from a cachefile that failed to update
-         * after the detach, a top-level vdev will appear as a non top-level
-         * vdev in the config. Also relax the constraints if we perform an
-         * extreme rewind.
+                 * If this vdev just became a top-level vdev because its
+                 * sibling was detached, it will have adopted the parent's
+                 * vdev guid -- but the label may or may not be on disk yet.
+                 * Fortunately, either version of the label will have the
+                 * same top guid, so if we're a top-level vdev, we can
+                 * safely compare to that instead.
          *
          * If we split this vdev off instead, then we also check the
          * original pool's guid. We don't want to consider the vdev
          * corrupt if it is partway through a split operation.
          */
-        if (vd->vdev_guid != guid && vd->vdev_guid != aux_guid) {
-                boolean_t mismatch = B_FALSE;
-                if (spa->spa_trust_config && !spa->spa_extreme_rewind) {
-                        if (vd != vd->vdev_top || vd->vdev_guid != top_guid)
-                                mismatch = B_TRUE;
-                } else {
-                        if (vd->vdev_guid != top_guid &&
-                            vd->vdev_top->vdev_guid != guid)
-                                mismatch = B_TRUE;
-                }
-
-                if (mismatch) {
+                if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID,
+                    &guid) != 0 ||
+                    nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID,
+                    &top_guid) != 0 ||
+                    ((vd->vdev_guid != guid && vd->vdev_guid != aux_guid) &&
+                    (vd->vdev_guid != top_guid || vd != vd->vdev_top))) {
                         vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
                             VDEV_AUX_CORRUPT_DATA);
                         nvlist_free(label);
-                        vdev_dbgmsg(vd, "vdev_validate: config guid "
-                            "doesn't match label guid");
-                        vdev_dbgmsg(vd, "CONFIG: guid %llu, top_guid %llu",
-                            (u_longlong_t)vd->vdev_guid,
-                            (u_longlong_t)vd->vdev_top->vdev_guid);
-                        vdev_dbgmsg(vd, "LABEL: guid %llu, top_guid %llu, "
-                            "aux_guid %llu", (u_longlong_t)guid,
-                            (u_longlong_t)top_guid, (u_longlong_t)aux_guid);
                         return (0);
                 }
-        }
 
         if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE,
             &state) != 0) {
                 vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
                     VDEV_AUX_CORRUPT_DATA);
                 nvlist_free(label);
-                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
-                    ZPOOL_CONFIG_POOL_STATE);
                 return (0);
         }
 
         nvlist_free(label);

@@ -1673,139 +1560,26 @@
          * If this is a verbatim import, no need to check the
          * state of the pool.
          */
         if (!(spa->spa_import_flags & ZFS_IMPORT_VERBATIM) &&
             spa_load_state(spa) == SPA_LOAD_OPEN &&
-            state != POOL_STATE_ACTIVE) {
-                vdev_dbgmsg(vd, "vdev_validate: invalid pool state (%llu) "
-                    "for spa %s", (u_longlong_t)state, spa->spa_name);
+                    state != POOL_STATE_ACTIVE)
                 return (SET_ERROR(EBADF));
-        }
 
         /*
          * If we were able to open and validate a vdev that was
          * previously marked permanently unavailable, clear that state
          * now.
          */
         if (vd->vdev_not_present)
                 vd->vdev_not_present = 0;
-
-        return (0);
-}
-
-static void
-vdev_copy_path_impl(vdev_t *svd, vdev_t *dvd)
-{
-        if (svd->vdev_path != NULL && dvd->vdev_path != NULL) {
-                if (strcmp(svd->vdev_path, dvd->vdev_path) != 0) {
-                        zfs_dbgmsg("vdev_copy_path: vdev %llu: path changed "
-                            "from '%s' to '%s'", (u_longlong_t)dvd->vdev_guid,
-                            dvd->vdev_path, svd->vdev_path);
-                        spa_strfree(dvd->vdev_path);
-                        dvd->vdev_path = spa_strdup(svd->vdev_path);
                 }
-        } else if (svd->vdev_path != NULL) {
-                dvd->vdev_path = spa_strdup(svd->vdev_path);
-                zfs_dbgmsg("vdev_copy_path: vdev %llu: path set to '%s'",
-                    (u_longlong_t)dvd->vdev_guid, dvd->vdev_path);
-        }
-}
 
-/*
- * Recursively copy vdev paths from one vdev to another. Source and destination
- * vdev trees must have same geometry otherwise return error. Intended to copy
- * paths from userland config into MOS config.
- */
-int
-vdev_copy_path_strict(vdev_t *svd, vdev_t *dvd)
-{
-        if ((svd->vdev_ops == &vdev_missing_ops) ||
-            (svd->vdev_ishole && dvd->vdev_ishole) ||
-            (dvd->vdev_ops == &vdev_indirect_ops))
                 return (0);
-
-        if (svd->vdev_ops != dvd->vdev_ops) {
-                vdev_dbgmsg(svd, "vdev_copy_path: vdev type mismatch: %s != %s",
-                    svd->vdev_ops->vdev_op_type, dvd->vdev_ops->vdev_op_type);
-                return (SET_ERROR(EINVAL));
-        }
-
-        if (svd->vdev_guid != dvd->vdev_guid) {
-                vdev_dbgmsg(svd, "vdev_copy_path: guids mismatch (%llu != "
-                    "%llu)", (u_longlong_t)svd->vdev_guid,
-                    (u_longlong_t)dvd->vdev_guid);
-                return (SET_ERROR(EINVAL));
-        }
-
-        if (svd->vdev_children != dvd->vdev_children) {
-                vdev_dbgmsg(svd, "vdev_copy_path: children count mismatch: "
-                    "%llu != %llu", (u_longlong_t)svd->vdev_children,
-                    (u_longlong_t)dvd->vdev_children);
-                return (SET_ERROR(EINVAL));
-        }
-
-        for (uint64_t i = 0; i < svd->vdev_children; i++) {
-                int error = vdev_copy_path_strict(svd->vdev_child[i],
-                    dvd->vdev_child[i]);
-                if (error != 0)
-                        return (error);
-        }
-
-        if (svd->vdev_ops->vdev_op_leaf)
-                vdev_copy_path_impl(svd, dvd);
-
-        return (0);
 }
 
-static void
-vdev_copy_path_search(vdev_t *stvd, vdev_t *dvd)
-{
-        ASSERT(stvd->vdev_top == stvd);
-        ASSERT3U(stvd->vdev_id, ==, dvd->vdev_top->vdev_id);
-
-        for (uint64_t i = 0; i < dvd->vdev_children; i++) {
-                vdev_copy_path_search(stvd, dvd->vdev_child[i]);
-        }
-
-        if (!dvd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(dvd))
-                return;
-
-        /*
-         * The idea here is that while a vdev can shift positions within
-         * a top vdev (when replacing, attaching mirror, etc.) it cannot
-         * step outside of it.
-         */
-        vdev_t *vd = vdev_lookup_by_guid(stvd, dvd->vdev_guid);
-
-        if (vd == NULL || vd->vdev_ops != dvd->vdev_ops)
-                return;
-
-        ASSERT(vd->vdev_ops->vdev_op_leaf);
-
-        vdev_copy_path_impl(vd, dvd);
-}
-
 /*
- * Recursively copy vdev paths from one root vdev to another. Source and
- * destination vdev trees may differ in geometry. For each destination leaf
- * vdev, search a vdev with the same guid and top vdev id in the source.
- * Intended to copy paths from userland config into MOS config.
- */
-void
-vdev_copy_path_relaxed(vdev_t *srvd, vdev_t *drvd)
-{
-        uint64_t children = MIN(srvd->vdev_children, drvd->vdev_children);
-        ASSERT(srvd->vdev_ops == &vdev_root_ops);
-        ASSERT(drvd->vdev_ops == &vdev_root_ops);
-
-        for (uint64_t i = 0; i < children; i++) {
-                vdev_copy_path_search(srvd->vdev_child[i],
-                    drvd->vdev_child[i]);
-        }
-}
-
-/*
  * Close a virtual device.
  */
 void
 vdev_close(vdev_t *vd)
 {

@@ -1893,14 +1667,20 @@
          */
         if (vd->vdev_aux) {
                 (void) vdev_validate_aux(vd);
                 if (vdev_readable(vd) && vdev_writeable(vd) &&
                     vd->vdev_aux == &spa->spa_l2cache &&
-                    !l2arc_vdev_present(vd))
-                        l2arc_add_vdev(spa, vd);
+                    !l2arc_vdev_present(vd)) {
+                        /*
+                         * When reopening we can assume persistent L2ARC is
+                         * supported, since we've already opened the device
+                         * in the past and prepended an L2ARC uberblock.
+                         */
+                        l2arc_add_vdev(spa, vd, B_TRUE);
+                }
         } else {
-                (void) vdev_validate(vd);
+                (void) vdev_validate(vd, B_TRUE);
         }
 
         /*
          * Reassess parent vdev's health.
          */

@@ -1949,12 +1729,11 @@
 
 void
 vdev_dirty(vdev_t *vd, int flags, void *arg, uint64_t txg)
 {
         ASSERT(vd == vd->vdev_top);
-        /* indirect vdevs don't have metaslabs or dtls */
-        ASSERT(vdev_is_concrete(vd) || flags == 0);
+        ASSERT(!vd->vdev_ishole);
         ASSERT(ISP2(flags));
         ASSERT(spa_writeable(vd->vdev_spa));
 
         if (flags & VDD_METASLAB)
                 (void) txg_list_add(&vd->vdev_ms_list, arg, txg);

@@ -2020,14 +1799,14 @@
 
         ASSERT(t < DTL_TYPES);
         ASSERT(vd != vd->vdev_spa->spa_root_vdev);
         ASSERT(spa_writeable(vd->vdev_spa));
 
-        mutex_enter(&vd->vdev_dtl_lock);
+        mutex_enter(rt->rt_lock);
         if (!range_tree_contains(rt, txg, size))
                 range_tree_add(rt, txg, size);
-        mutex_exit(&vd->vdev_dtl_lock);
+        mutex_exit(rt->rt_lock);
 }
 
 boolean_t
 vdev_dtl_contains(vdev_t *vd, vdev_dtl_type_t t, uint64_t txg, uint64_t size)
 {

@@ -2035,25 +1814,14 @@
         boolean_t dirty = B_FALSE;
 
         ASSERT(t < DTL_TYPES);
         ASSERT(vd != vd->vdev_spa->spa_root_vdev);
 
-        /*
-         * While we are loading the pool, the DTLs have not been loaded yet.
-         * Ignore the DTLs and try all devices.  This avoids a recursive
-         * mutex enter on the vdev_dtl_lock, and also makes us try hard
-         * when loading the pool (relying on the checksum to ensure that
-         * we get the right data -- note that we while loading, we are
-         * only reading the MOS, which is always checksummed).
-         */
-        if (vd->vdev_spa->spa_load_state != SPA_LOAD_NONE)
-                return (B_FALSE);
-
-        mutex_enter(&vd->vdev_dtl_lock);
+        mutex_enter(rt->rt_lock);
         if (range_tree_space(rt) != 0)
                 dirty = range_tree_contains(rt, txg, size);
-        mutex_exit(&vd->vdev_dtl_lock);
+        mutex_exit(rt->rt_lock);
 
         return (dirty);
 }
 
 boolean_t

@@ -2060,13 +1828,13 @@
 vdev_dtl_empty(vdev_t *vd, vdev_dtl_type_t t)
 {
         range_tree_t *rt = vd->vdev_dtl[t];
         boolean_t empty;
 
-        mutex_enter(&vd->vdev_dtl_lock);
+        mutex_enter(rt->rt_lock);
         empty = (range_tree_space(rt) == 0);
-        mutex_exit(&vd->vdev_dtl_lock);
+        mutex_exit(rt->rt_lock);
 
         return (empty);
 }
 
 /*

@@ -2155,11 +1923,11 @@
 
         for (int c = 0; c < vd->vdev_children; c++)
                 vdev_dtl_reassess(vd->vdev_child[c], txg,
                     scrub_txg, scrub_done);
 
-        if (vd == spa->spa_root_vdev || !vdev_is_concrete(vd) || vd->vdev_aux)
+        if (vd == spa->spa_root_vdev || vd->vdev_ishole || vd->vdev_aux)
                 return;
 
         if (vd->vdev_ops->vdev_op_leaf) {
                 dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan;

@@ -2261,14 +2029,14 @@
         spa_t *spa = vd->vdev_spa;
         objset_t *mos = spa->spa_meta_objset;
         int error = 0;
 
         if (vd->vdev_ops->vdev_op_leaf && vd->vdev_dtl_object != 0) {
-                ASSERT(vdev_is_concrete(vd));
+                ASSERT(!vd->vdev_ishole);
 
                 error = space_map_open(&vd->vdev_dtl_sm, mos,
-                    vd->vdev_dtl_object, 0, -1ULL, 0);
+                    vd->vdev_dtl_object, 0, -1ULL, 0, &vd->vdev_dtl_lock);
                 if (error)
                         return (error);
                 ASSERT(vd->vdev_dtl_sm != NULL);
 
                 mutex_enter(&vd->vdev_dtl_lock);

@@ -2343,14 +2111,15 @@
 {
         spa_t *spa = vd->vdev_spa;
         range_tree_t *rt = vd->vdev_dtl[DTL_MISSING];
         objset_t *mos = spa->spa_meta_objset;
         range_tree_t *rtsync;
+        kmutex_t rtlock;
         dmu_tx_t *tx;
         uint64_t object = space_map_object(vd->vdev_dtl_sm);
 
-        ASSERT(vdev_is_concrete(vd));
+        ASSERT(!vd->vdev_ishole);
         ASSERT(vd->vdev_ops->vdev_op_leaf);
 
         tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
 
         if (vd->vdev_detached || vd->vdev_top->vdev_removing) {

@@ -2364,11 +2133,11 @@
                  * We only destroy the leaf ZAP for detached leaves or for
                  * removed log devices. Removed data devices handle leaf ZAP
                  * cleanup later, once cancellation is no longer possible.
                  */
                 if (vd->vdev_leaf_zap != 0 && (vd->vdev_detached ||
-                    vd->vdev_top->vdev_islog)) {
+                    vd->vdev_top->vdev_islog || vd->vdev_top->vdev_isspecial)) {
                         vdev_destroy_unlink_zap(vd, vd->vdev_leaf_zap, tx);
                         vd->vdev_leaf_zap = 0;
                 }
 
                 dmu_tx_commit(tx);

@@ -2380,16 +2149,20 @@
 
                 new_object = space_map_alloc(mos, tx);
                 VERIFY3U(new_object, !=, 0);
 
                 VERIFY0(space_map_open(&vd->vdev_dtl_sm, mos, new_object,
-                    0, -1ULL, 0));
+                    0, -1ULL, 0, &vd->vdev_dtl_lock));
                 ASSERT(vd->vdev_dtl_sm != NULL);
         }
 
-        rtsync = range_tree_create(NULL, NULL);
+        mutex_init(&rtlock, NULL, MUTEX_DEFAULT, NULL);
 
+        rtsync = range_tree_create(NULL, NULL, &rtlock);
+
+        mutex_enter(&rtlock);
+
         mutex_enter(&vd->vdev_dtl_lock);
         range_tree_walk(rt, range_tree_add, rtsync);
         mutex_exit(&vd->vdev_dtl_lock);
 
         space_map_truncate(vd->vdev_dtl_sm, tx);

@@ -2396,19 +2169,21 @@
         space_map_write(vd->vdev_dtl_sm, rtsync, SM_ALLOC, tx);
         range_tree_vacate(rtsync, NULL, NULL);
 
         range_tree_destroy(rtsync);
 
+        mutex_exit(&rtlock);
+        mutex_destroy(&rtlock);
+
         /*
          * If the object for the space map has changed then dirty
          * the top level so that we update the config.
          */
         if (object != space_map_object(vd->vdev_dtl_sm)) {
-                vdev_dbgmsg(vd, "txg %llu, spa %s, DTL old object %llu, "
-                    "new object %llu", (u_longlong_t)txg, spa_name(spa),
-                    (u_longlong_t)object,
-                    (u_longlong_t)space_map_object(vd->vdev_dtl_sm));
+                zfs_dbgmsg("txg %llu, spa %s, DTL old object %llu, "
+                    "new object %llu", txg, spa_name(spa), object,
+                    space_map_object(vd->vdev_dtl_sm));
                 vdev_config_dirty(vd->vdev_top);
         }
 
         dmu_tx_commit(tx);

@@ -2489,76 +2264,34 @@
                 *maxp = thismax;
         }
         return (needed);
 }
 
-int
+void
 vdev_load(vdev_t *vd)
 {
-        int error = 0;
         /*
          * Recursively load all children.
          */
-        for (int c = 0; c < vd->vdev_children; c++) {
-                error = vdev_load(vd->vdev_child[c]);
-                if (error != 0) {
-                        return (error);
-                }
-        }
+        for (int c = 0; c < vd->vdev_children; c++)
+                vdev_load(vd->vdev_child[c]);
 
-        vdev_set_deflate_ratio(vd);
-
         /*
          * If this is a top-level vdev, initialize its metaslabs.
          */
-        if (vd == vd->vdev_top && vdev_is_concrete(vd)) {
-                if (vd->vdev_ashift == 0 || vd->vdev_asize == 0) {
+        if (vd == vd->vdev_top && !vd->vdev_ishole &&
+            (vd->vdev_ashift == 0 || vd->vdev_asize == 0 ||
+            vdev_metaslab_init(vd, 0) != 0))
                         vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
                             VDEV_AUX_CORRUPT_DATA);
-                        vdev_dbgmsg(vd, "vdev_load: invalid size. ashift=%llu, "
-                            "asize=%llu", (u_longlong_t)vd->vdev_ashift,
-                            (u_longlong_t)vd->vdev_asize);
-                        return (SET_ERROR(ENXIO));
-                } else if ((error = vdev_metaslab_init(vd, 0)) != 0) {
-                        vdev_dbgmsg(vd, "vdev_load: metaslab_init failed "
-                            "[error=%d]", error);
-                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
-                            VDEV_AUX_CORRUPT_DATA);
-                        return (error);
-                }
-        }
 
         /*
          * If this is a leaf vdev, load its DTL.
          */
-        if (vd->vdev_ops->vdev_op_leaf && (error = vdev_dtl_load(vd)) != 0) {
+        if (vd->vdev_ops->vdev_op_leaf && vdev_dtl_load(vd) != 0)
                 vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
                     VDEV_AUX_CORRUPT_DATA);
-                vdev_dbgmsg(vd, "vdev_load: vdev_dtl_load failed "
-                    "[error=%d]", error);
-                return (error);
-        }
-
-        uint64_t obsolete_sm_object = vdev_obsolete_sm_object(vd);
-        if (obsolete_sm_object != 0) {
-                objset_t *mos = vd->vdev_spa->spa_meta_objset;
-                ASSERT(vd->vdev_asize != 0);
-                ASSERT(vd->vdev_obsolete_sm == NULL);
-
-                if ((error = space_map_open(&vd->vdev_obsolete_sm, mos,
-                    obsolete_sm_object, 0, vd->vdev_asize, 0))) {
-                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
-                            VDEV_AUX_CORRUPT_DATA);
-                        vdev_dbgmsg(vd, "vdev_load: space_map_open failed for "
-                            "obsolete spacemap (obj %llu) [error=%d]",
-                            (u_longlong_t)obsolete_sm_object, error);
-                        return (error);
-                }
-                space_map_update(vd->vdev_obsolete_sm);
-        }
-
-        return (0);
 }
 
 /*
  * The special vdev case is used for hot spares and l2cache devices.  Its
  * sole purpose it to set the vdev state for the associated vdev.  To do this,

@@ -2599,46 +2332,18 @@
          */
         nvlist_free(label);
         return (0);
 }
 
-/*
- * Free the objects used to store this vdev's spacemaps, and the array
- * that points to them.
- */
 void
-vdev_destroy_spacemaps(vdev_t *vd, dmu_tx_t *tx)
+vdev_remove(vdev_t *vd, uint64_t txg)
 {
-        if (vd->vdev_ms_array == 0)
-                return;
-
-        objset_t *mos = vd->vdev_spa->spa_meta_objset;
-        uint64_t array_count = vd->vdev_asize >> vd->vdev_ms_shift;
-        size_t array_bytes = array_count * sizeof (uint64_t);
-        uint64_t *smobj_array = kmem_alloc(array_bytes, KM_SLEEP);
-        VERIFY0(dmu_read(mos, vd->vdev_ms_array, 0,
-            array_bytes, smobj_array, 0));
-
-        for (uint64_t i = 0; i < array_count; i++) {
-                uint64_t smobj = smobj_array[i];
-                if (smobj == 0)
-                        continue;
-
-                space_map_free_obj(mos, smobj, tx);
-        }
-
-        kmem_free(smobj_array, array_bytes);
-        VERIFY0(dmu_object_free(mos, vd->vdev_ms_array, tx));
-        vd->vdev_ms_array = 0;
-}
-
-static void
-vdev_remove_empty(vdev_t *vd, uint64_t txg)
-{
         spa_t *spa = vd->vdev_spa;
+        objset_t *mos = spa->spa_meta_objset;
         dmu_tx_t *tx;
 
+        tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
         ASSERT(vd == vd->vdev_top);
         ASSERT3U(txg, ==, spa_syncing_txg(spa));
 
         if (vd->vdev_ms != NULL) {
                 metaslab_group_t *mg = vd->vdev_mg;

@@ -2661,25 +2366,30 @@
                          * and metaslab class are up-to-date.
                          */
                         metaslab_group_histogram_remove(mg, msp);
 
                         VERIFY0(space_map_allocated(msp->ms_sm));
+                        space_map_free(msp->ms_sm, tx);
                         space_map_close(msp->ms_sm);
                         msp->ms_sm = NULL;
                         mutex_exit(&msp->ms_lock);
                 }
 
                 metaslab_group_histogram_verify(mg);
                 metaslab_class_histogram_verify(mg->mg_class);
                 for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
                         ASSERT0(mg->mg_histogram[i]);
+
         }
 
-        tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
-        vdev_destroy_spacemaps(vd, tx);
+        if (vd->vdev_ms_array) {
+                (void) dmu_object_free(mos, vd->vdev_ms_array, tx);
+                vd->vdev_ms_array = 0;
+        }
 
-        if (vd->vdev_islog && vd->vdev_top_zap != 0) {
+        if ((vd->vdev_islog || vd->vdev_isspecial) &&
+            vd->vdev_top_zap != 0) {
                 vdev_destroy_unlink_zap(vd, vd->vdev_top_zap, tx);
                 vd->vdev_top_zap = 0;
         }
         dmu_tx_commit(tx);
 }

@@ -2688,11 +2398,11 @@
 vdev_sync_done(vdev_t *vd, uint64_t txg)
 {
         metaslab_t *msp;
         boolean_t reassess = !txg_list_empty(&vd->vdev_ms_list, TXG_CLEAN(txg));
 
-        ASSERT(vdev_is_concrete(vd));
+        ASSERT(!vd->vdev_ishole);
 
         while (msp = txg_list_remove(&vd->vdev_ms_list, TXG_CLEAN(txg)))
                 metaslab_sync_done(msp, txg);
 
         if (reassess)

@@ -2705,63 +2415,36 @@
         spa_t *spa = vd->vdev_spa;
         vdev_t *lvd;
         metaslab_t *msp;
         dmu_tx_t *tx;
 
-        if (range_tree_space(vd->vdev_obsolete_segments) > 0) {
-                dmu_tx_t *tx;
+        ASSERT(!vd->vdev_ishole);
 
-                ASSERT(vd->vdev_removing ||
-                    vd->vdev_ops == &vdev_indirect_ops);
-
-                tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
-                vdev_indirect_sync_obsolete(vd, tx);
-                dmu_tx_commit(tx);
-
-                /*
-                 * If the vdev is indirect, it can't have dirty
-                 * metaslabs or DTLs.
-                 */
-                if (vd->vdev_ops == &vdev_indirect_ops) {
-                        ASSERT(txg_list_empty(&vd->vdev_ms_list, txg));
-                        ASSERT(txg_list_empty(&vd->vdev_dtl_list, txg));
-                        return;
-                }
-        }
-
-        ASSERT(vdev_is_concrete(vd));
-
-        if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0 &&
-            !vd->vdev_removing) {
+        if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0) {
                 ASSERT(vd == vd->vdev_top);
-                ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
                 tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
                 vd->vdev_ms_array = dmu_object_alloc(spa->spa_meta_objset,
                     DMU_OT_OBJECT_ARRAY, 0, DMU_OT_NONE, 0, tx);
                 ASSERT(vd->vdev_ms_array != 0);
                 vdev_config_dirty(vd);
                 dmu_tx_commit(tx);
         }
 
+        /*
+         * Remove the metadata associated with this vdev once it's empty.
+         */
+        if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing)
+                vdev_remove(vd, txg);
+
         while ((msp = txg_list_remove(&vd->vdev_ms_list, txg)) != NULL) {
                 metaslab_sync(msp, txg);
                 (void) txg_list_add(&vd->vdev_ms_list, msp, TXG_CLEAN(txg));
         }
 
         while ((lvd = txg_list_remove(&vd->vdev_dtl_list, txg)) != NULL)
                 vdev_dtl_sync(lvd, txg);
 
-        /*
-         * Remove the metadata associated with this vdev once it's empty.
-         * Note that this is typically used for log/cache device removal;
-         * we don't empty toplevel vdevs when removing them.  But if
-         * a toplevel happens to be emptied, this is not harmful.
-         */
-        if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing) {
-                vdev_remove_empty(vd, txg);
-        }
-
         (void) txg_list_add(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg));
 }
 
 uint64_t
 vdev_psize_to_asize(vdev_t *vd, uint64_t psize)

@@ -2881,12 +2564,12 @@
 
         wasoffline = (vd->vdev_offline || vd->vdev_tmpoffline);
         oldstate = vd->vdev_state;
 
         tvd = vd->vdev_top;
-        vd->vdev_offline = B_FALSE;
-        vd->vdev_tmpoffline = B_FALSE;
+        vd->vdev_offline = 0ULL;
+        vd->vdev_tmpoffline = 0ULL;
         vd->vdev_checkremove = !!(flags & ZFS_ONLINE_CHECKREMOVE);
         vd->vdev_forcefault = !!(flags & ZFS_ONLINE_FORCEFAULT);
 
         /* XXX - L2ARC 1.0 does not support expansion */
         if (!vd->vdev_aux) {

@@ -2971,11 +2654,11 @@
                          * Prevent any future allocations.
                          */
                         metaslab_group_passivate(mg);
                         (void) spa_vdev_state_exit(spa, vd, 0);
 
-                        error = spa_reset_logs(spa);
+                        error = spa_offline_log(spa);
 
                         spa_vdev_state_enter(spa, SCL_ALLOC);
 
                         /*
                          * Check to see if the config has changed.

@@ -3038,30 +2721,45 @@
  * children.  If 'vd' is NULL, then the user wants to clear all vdevs.
  */
 void
 vdev_clear(spa_t *spa, vdev_t *vd)
 {
+        int c;
         vdev_t *rvd = spa->spa_root_vdev;
 
         ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
 
-        if (vd == NULL)
+        if (vd == NULL) {
                 vd = rvd;
 
+                /* Go through spare and l2cache vdevs */
+                for (c = 0; c < spa->spa_spares.sav_count; c++)
+                        vdev_clear(spa, spa->spa_spares.sav_vdevs[c]);
+                for (c = 0; c < spa->spa_l2cache.sav_count; c++)
+                        vdev_clear(spa, spa->spa_l2cache.sav_vdevs[c]);
+        }
+
         vd->vdev_stat.vs_read_errors = 0;
         vd->vdev_stat.vs_write_errors = 0;
         vd->vdev_stat.vs_checksum_errors = 0;
 
-        for (int c = 0; c < vd->vdev_children; c++)
-                vdev_clear(spa, vd->vdev_child[c]);
-
         /*
-         * It makes no sense to "clear" an indirect vdev.
+         * If all disk vdevs failed at the same time (e.g. due to a
+         * disconnected cable), that suspends I/O activity to the pool,
+         * which stalls spa_sync if there happened to be any dirty data.
+         * As a consequence, this flag might not be cleared, because it
+         * is only lowered by spa_async_remove (which cannot run). This
+         * then prevents zio_resume from succeeding even if vdev reopen
+         * succeeds, leading to an indefinitely suspended pool. So we
+         * lower the flag here to allow zio_resume to succeed, provided
+         * reopening of the vdevs succeeds.
          */
-        if (!vdev_is_concrete(vd))
-                return;
+        vd->vdev_remove_wanted = B_FALSE;
 
+        for (c = 0; c < vd->vdev_children; c++)
+                vdev_clear(spa, vd->vdev_child[c]);
+
         /*
          * If we're in the FAULTED state or have experienced failed I/O, then
          * clear the persistent state and attempt to reopen the device.  We
          * also mark the vdev config dirty, so that the new faulted state is
          * written out to disk.

@@ -3112,26 +2810,24 @@
          * This simplifies the code since we don't have to check for
          * these types of devices in the various code paths.
          * Instead we rely on the fact that we skip over dead devices
          * before issuing I/O to them.
          */
-        return (vd->vdev_state < VDEV_STATE_DEGRADED ||
-            vd->vdev_ops == &vdev_hole_ops ||
+        return (vd->vdev_state < VDEV_STATE_DEGRADED || vd->vdev_ishole ||
             vd->vdev_ops == &vdev_missing_ops);
 }
 
 boolean_t
 vdev_readable(vdev_t *vd)
 {
-        return (!vdev_is_dead(vd) && !vd->vdev_cant_read);
+        return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_read);
 }
 
 boolean_t
 vdev_writeable(vdev_t *vd)
 {
-        return (!vdev_is_dead(vd) && !vd->vdev_cant_write &&
-            vdev_is_concrete(vd));
+        return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_write);
 }
 
 boolean_t
 vdev_allocatable(vdev_t *vd)
 {

@@ -3144,11 +2840,11 @@
          * the proper locks.  Note that we have to get the vdev state
          * in a local variable because although it changes atomically,
          * we're asking two separate questions about it.
          */
         return (!(state < VDEV_STATE_DEGRADED && state != VDEV_STATE_CLOSED) &&
-            !vd->vdev_cant_write && vdev_is_concrete(vd) &&
+            !vd->vdev_cant_write && !vd->vdev_ishole &&
             vd->vdev_mg->mg_initialized);
 }
 
 boolean_t
 vdev_accessible(vdev_t *vd, zio_t *zio)

@@ -3193,12 +2889,11 @@
          */
         if (vd->vdev_aux == NULL && tvd != NULL) {
                 vs->vs_esize = P2ALIGN(vd->vdev_max_asize - vd->vdev_asize -
                     spa->spa_bootsize, 1ULL << tvd->vdev_ms_shift);
         }
-        if (vd->vdev_aux == NULL && vd == vd->vdev_top &&
-            vdev_is_concrete(vd)) {
+        if (vd->vdev_aux == NULL && vd == vd->vdev_top && !vd->vdev_ishole) {
                 vs->vs_fragmentation = vd->vdev_mg->mg_fragmentation;
         }
 
         /*
          * If we're getting stats on the root vdev, aggregate the I/O counts

@@ -3210,10 +2905,12 @@
                         vdev_stat_t *cvs = &cvd->vdev_stat;
 
                         for (int t = 0; t < ZIO_TYPES; t++) {
                                 vs->vs_ops[t] += cvs->vs_ops[t];
                                 vs->vs_bytes[t] += cvs->vs_bytes[t];
+                                vs->vs_iotime[t] += cvs->vs_iotime[t];
+                                vs->vs_latency[t] += cvs->vs_latency[t];
                         }
                         cvs->vs_scan_removing = cvd->vdev_removing;
                 }
         }
         mutex_exit(&vd->vdev_stat_lock);

@@ -3302,10 +2999,24 @@
                 }
 
                 vs->vs_ops[type]++;
                 vs->vs_bytes[type] += psize;
 
+                /*
+                 * While measuring each delta in nanoseconds, we should keep
+                 * cumulative iotime in microseconds so it doesn't overflow on
+                 * a busy system.
+                 */
+                vs->vs_iotime[type] += (zio->io_vd_timestamp) / 1000;
+
+                /*
+                 * Latency is an exponential moving average of iotime deltas
+                 * with tuneable alpha measured in 1/10th of percent.
+                 */
+                vs->vs_latency[type] += ((int64_t)zio->io_vd_timestamp -
+                    vs->vs_latency[type]) * zfs_vs_latency_alpha / 1000;
+
                 mutex_exit(&vd->vdev_stat_lock);
                 return;
         }
 
         if (flags & ZIO_FLAG_SPECULATIVE)

@@ -3338,12 +3049,22 @@
         }
         if (type == ZIO_TYPE_WRITE && !vdev_is_dead(vd))
                 vs->vs_write_errors++;
         mutex_exit(&vd->vdev_stat_lock);
 
-        if (spa->spa_load_state == SPA_LOAD_NONE &&
-            type == ZIO_TYPE_WRITE && txg != 0 &&
+        if ((vd->vdev_isspecial || vd->vdev_isspecial_child) &&
+            (vs->vs_checksum_errors != 0 || vs->vs_read_errors != 0 ||
+            vs->vs_write_errors != 0 || !vdev_readable(vd) ||
+            !vdev_writeable(vd)) && !spa->spa_special_has_errors) {
+                /* all new writes will be placed on normal */
+                cmn_err(CE_WARN, "New writes to special vdev [%s] "
+                    "will be stopped", (vd->vdev_path != NULL) ?
+                    vd->vdev_path : "undefined");
+                spa->spa_special_has_errors = B_TRUE;
+        }
+
+        if (type == ZIO_TYPE_WRITE && txg != 0 &&
             (!(flags & ZIO_FLAG_IO_REPAIR) ||
             (flags & ZIO_FLAG_SCAN_THREAD) ||
             spa->spa_claiming)) {
                 /*
                  * This is either a normal write (not a repair), or it's

@@ -3414,11 +3135,11 @@
         vd->vdev_stat.vs_alloc += alloc_delta;
         vd->vdev_stat.vs_space += space_delta;
         vd->vdev_stat.vs_dspace += dspace_delta;
         mutex_exit(&vd->vdev_stat_lock);
 
-        if (mc == spa_normal_class(spa)) {
+        if (mc == spa_normal_class(spa) || mc == spa_special_class(spa)) {
                 mutex_enter(&rvd->vdev_stat_lock);
                 rvd->vdev_stat.vs_alloc += alloc_delta;
                 rvd->vdev_stat.vs_space += space_delta;
                 rvd->vdev_stat.vs_dspace += dspace_delta;
                 mutex_exit(&rvd->vdev_stat_lock);

@@ -3504,14 +3225,13 @@
                         vdev_config_dirty(rvd->vdev_child[c]);
         } else {
                 ASSERT(vd == vd->vdev_top);
 
                 if (!list_link_active(&vd->vdev_config_dirty_node) &&
-                    vdev_is_concrete(vd)) {
+                    !vd->vdev_ishole)
                         list_insert_head(&spa->spa_config_dirty_list, vd);
                 }
-        }
 }
 
 void
 vdev_config_clean(vdev_t *vd)
 {

@@ -3547,12 +3267,11 @@
          */
         ASSERT(spa_config_held(spa, SCL_STATE, RW_WRITER) ||
             (dsl_pool_sync_context(spa_get_dsl(spa)) &&
             spa_config_held(spa, SCL_STATE, RW_READER)));
 
-        if (!list_link_active(&vd->vdev_state_dirty_node) &&
-            vdev_is_concrete(vd))
+        if (!list_link_active(&vd->vdev_state_dirty_node) && !vd->vdev_ishole)
                 list_insert_head(&spa->spa_state_dirty_list, vd);
 }
 
 void
 vdev_state_clean(vdev_t *vd)

@@ -3582,14 +3301,13 @@
         if (vd->vdev_children > 0) {
                 for (int c = 0; c < vd->vdev_children; c++) {
                         child = vd->vdev_child[c];
 
                         /*
-                         * Don't factor holes or indirect vdevs into the
-                         * decision.
+                         * Don't factor holes into the decision.
                          */
-                        if (!vdev_is_concrete(child))
+                        if (child->vdev_ishole)
                                 continue;
 
                         if (!vdev_readable(child) ||
                             (!vdev_writeable(child) && spa_writeable(spa))) {
                                 /*

@@ -3760,23 +3478,10 @@
 
         if (!isopen && vd->vdev_parent)
                 vdev_propagate_state(vd->vdev_parent);
 }
 
-boolean_t
-vdev_children_are_offline(vdev_t *vd)
-{
-        ASSERT(!vd->vdev_ops->vdev_op_leaf);
-
-        for (uint64_t i = 0; i < vd->vdev_children; i++) {
-                if (vd->vdev_child[i]->vdev_state != VDEV_STATE_OFFLINE)
-                        return (B_FALSE);
-        }
-
-        return (B_TRUE);
-}
-
 /*
  * Check the vdev configuration to ensure that it's capable of supporting
  * a root pool. We do not support partial configuration.
  * In addition, only a single top-level vdev is allowed.
  */

@@ -3787,12 +3492,11 @@
                 char *vdev_type = vd->vdev_ops->vdev_op_type;
 
                 if (strcmp(vdev_type, VDEV_TYPE_ROOT) == 0 &&
                     vd->vdev_children > 1) {
                         return (B_FALSE);
-                } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0 ||
-                    strcmp(vdev_type, VDEV_TYPE_INDIRECT) == 0) {
+                } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0) {
                         return (B_FALSE);
                 }
         }
 
         for (int c = 0; c < vd->vdev_children; c++) {

@@ -3800,19 +3504,36 @@
                         return (B_FALSE);
         }
         return (B_TRUE);
 }
 
-boolean_t
-vdev_is_concrete(vdev_t *vd)
+/*
+ * Load the state from the original vdev tree (ovd) which
+ * we've retrieved from the MOS config object. If the original
+ * vdev was offline or faulted then we transfer that state to the
+ * device in the current vdev tree (nvd).
+ */
+void
+vdev_load_log_state(vdev_t *nvd, vdev_t *ovd)
 {
-        vdev_ops_t *ops = vd->vdev_ops;
-        if (ops == &vdev_indirect_ops || ops == &vdev_hole_ops ||
-            ops == &vdev_missing_ops || ops == &vdev_root_ops) {
-                return (B_FALSE);
-        } else {
-                return (B_TRUE);
+        spa_t *spa = nvd->vdev_spa;
+
+        ASSERT(nvd->vdev_top->vdev_islog);
+        ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
+        ASSERT3U(nvd->vdev_guid, ==, ovd->vdev_guid);
+
+        for (int c = 0; c < nvd->vdev_children; c++)
+                vdev_load_log_state(nvd->vdev_child[c], ovd->vdev_child[c]);
+
+        if (nvd->vdev_ops->vdev_op_leaf) {
+                /*
+                 * Restore the persistent vdev state
+                 */
+                nvd->vdev_offline = ovd->vdev_offline;
+                nvd->vdev_faulted = ovd->vdev_faulted;
+                nvd->vdev_degraded = ovd->vdev_degraded;
+                nvd->vdev_removed = ovd->vdev_removed;
         }
 }
 
 /*
  * Determine if a log device has valid content.  If the vdev was

@@ -3840,14 +3561,11 @@
 vdev_expand(vdev_t *vd, uint64_t txg)
 {
         ASSERT(vd->vdev_top == vd);
         ASSERT(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER) == SCL_ALL);
 
-        vdev_set_deflate_ratio(vd);
-
-        if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count &&
-            vdev_is_concrete(vd)) {
+        if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count) {
                 VERIFY(vdev_metaslab_init(vd, txg) == 0);
                 vdev_config_dirty(vd);
         }
 }

@@ -3894,16 +3612,141 @@
                          * the spa_deadman_synctime we panic the system.
                          */
                         fio = avl_first(&vq->vq_active_tree);
                         delta = gethrtime() - fio->io_timestamp;
                         if (delta > spa_deadman_synctime(spa)) {
-                                vdev_dbgmsg(vd, "SLOW IO: zio timestamp "
-                                    "%lluns, delta %lluns, last io %lluns",
-                                    fio->io_timestamp, (u_longlong_t)delta,
+                                zfs_dbgmsg("SLOW IO: zio timestamp %lluns, "
+                                    "delta %lluns, last io %lluns",
+                                    fio->io_timestamp, delta,
                                     vq->vq_io_complete_ts);
                                 fm_panic("I/O to pool '%s' appears to be "
                                     "hung.", spa_name(spa));
                         }
                 }
                 mutex_exit(&vq->vq_lock);
         }
+}
+
+boolean_t
+vdev_type_is_ddt(vdev_t *vd)
+{
+        uint64_t pool;
+
+        if (vd->vdev_l2ad_ddt == 1 &&
+            zfs_ddt_limit_type == DDT_LIMIT_TO_L2ARC) {
+                ASSERT(spa_l2cache_exists(vd->vdev_guid, &pool));
+                ASSERT(vd->vdev_isl2cache);
+                return (B_TRUE);
+        }
+        return (B_FALSE);
+}
+
+/* count leaf vdev(s) under the given vdev */
+uint_t
+vdev_count_leaf_vdevs(vdev_t *vd)
+{
+        uint_t cnt = 0;
+
+        if (vd->vdev_ops->vdev_op_leaf)
+                return (1);
+
+        /* if this is not a leaf vdev - visit children */
+        for (int c = 0; c < vd->vdev_children; c++)
+                cnt += vdev_count_leaf_vdevs(vd->vdev_child[c]);
+
+        return (cnt);
+}
+
+/*
+ * Implements the per-vdev portion of manual TRIM. The function passes over
+ * all metaslabs on this vdev and performs a metaslab_trim_all on them. It's
+ * also responsible for rate-control if spa_man_trim_rate is non-zero.
+ */
+void
+vdev_man_trim(vdev_trim_info_t *vti)
+{
+        clock_t t = ddi_get_lbolt();
+        spa_t *spa = vti->vti_vdev->vdev_spa;
+        vdev_t *vd = vti->vti_vdev;
+
+        vd->vdev_man_trimming = B_TRUE;
+        vd->vdev_trim_prog = 0;
+
+        spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+        for (uint64_t i = 0; i < vti->vti_vdev->vdev_ms_count &&
+            !spa->spa_man_trim_stop; i++) {
+                uint64_t delta;
+                metaslab_t *msp = vd->vdev_ms[i];
+                zio_t *trim_io = metaslab_trim_all(msp, &delta);
+
+                atomic_add_64(&vd->vdev_trim_prog, msp->ms_size);
+                spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+
+                (void) zio_wait(trim_io);
+
+                /* delay loop to handle fixed-rate trimming */
+                for (;;) {
+                        uint64_t rate = spa->spa_man_trim_rate;
+                        uint64_t sleep_delay;
+
+                        if (rate == 0) {
+                                /* No delay, just update 't' and move on. */
+                                t = ddi_get_lbolt();
+                                break;
+                        }
+
+                        sleep_delay = (delta * hz) / rate;
+                        mutex_enter(&spa->spa_man_trim_lock);
+                        (void) cv_timedwait(&spa->spa_man_trim_update_cv,
+                            &spa->spa_man_trim_lock, t);
+                        mutex_exit(&spa->spa_man_trim_lock);
+
+                        /* If interrupted, don't try to relock, get out */
+                        if (spa->spa_man_trim_stop)
+                                goto out;
+
+                        /* Timeout passed, move on to the next metaslab. */
+                        if (ddi_get_lbolt() >= t + sleep_delay) {
+                                t += sleep_delay;
+                                break;
+                        }
+                }
+                spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+        }
+        spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+out:
+        vd->vdev_man_trimming = B_FALSE;
+        /*
+         * Ensure we're marked as "completed" even if we've had to stop
+         * before processing all metaslabs.
+         */
+        vd->vdev_trim_prog = vd->vdev_asize;
+
+        ASSERT(vti->vti_done_cb != NULL);
+        vti->vti_done_cb(vti->vti_done_arg);
+
+        kmem_free(vti, sizeof (*vti));
+}
+
+/*
+ * Runs through all metaslabs on the vdev and does their autotrim processing.
+ */
+void
+vdev_auto_trim(vdev_trim_info_t *vti)
+{
+        vdev_t *vd = vti->vti_vdev;
+        spa_t *spa = vd->vdev_spa;
+        uint64_t txg = vti->vti_txg;
+
+        if (vd->vdev_man_trimming)
+                goto out;
+
+        spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+        for (uint64_t i = 0; i < vd->vdev_ms_count; i++)
+                metaslab_auto_trim(vd->vdev_ms[i], txg);
+        spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+out:
+        ASSERT(vti->vti_done_cb != NULL);
+        vti->vti_done_cb(vti->vti_done_arg);
+
+        kmem_free(vti, sizeof (*vti));
 }