Print this page
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9940 Appliance requires a reboot after JBOD power failure or disconnecting all SAS cables
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9554 dsl_scan.c internals contain some confusingly similar function names for handling the dataset and block sorting queues
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9562 Attaching a vdev while resilver/scrub is running causes panic.
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5736 implement autoreplace matching based on FRU slot number
NEX-6200 hot spares are not reactivated after reinserting into enclosure
NEX-9403 need to update FRU for spare and l2cache devices
NEX-9404 remove lofi autoreplace support from syseventd
NEX-9409 hotsparing doesn't work for vdevs without FRU
NEX-9424 zfs`vdev_online() needs better notification about state changes
Portions contributed by: Alek Pinchuk <alek@nexenta.com>
Portions contributed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8206 dtrace helpers leak when cfork() fails
Reviewed by: Rick McNeal <rick.mcneal@nexeneta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8507 erroneous check in vdev_type_is_ddt()
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4584 System panic when adding special vdev to a pool that does not support feature flags
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-2846 Enable Automatic/Intelligent Hot Sparing capability
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4940 Special Vdev operation in presence (or absense) of IO Errors
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3729 KRRP changes mess up iostat(1M)
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
NEX-4204 Removing vdev while on-demand trim is ongoing locks up pool
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3212 remove vdev prop object type from dmu.h, p2 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3025 support root pools on EFI labeled disks
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-1142 move rwlock to vdev to protect vdev_tsd
not just ldi handle.
This way we serialize open/close, yet allow parallel I/O.
NEX-801 If a block pointer is corrupt read or write may crash
If block pointer is corrupt in such a way that vdev id of one of the
ditto blocks is wrong (out of range), zio_vdev_io_start or zio_vdev_io_done
may trip over it and crash.
This changeset takes care of this by claiming that an invalid vdev is
neither readable, nor writeable.
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/vdev.c
          +++ new/usr/src/uts/common/fs/zfs/vdev.c
↓ open down ↓ 13 lines elided ↑ open up ↑
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24      - * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
  25      - * Copyright 2017 Nexenta Systems, Inc.
       24 + * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
       25 + * Copyright 2018 Nexenta Systems, Inc.
  26   26   * Copyright (c) 2014 Integros [integros.com]
  27   27   * Copyright 2016 Toomas Soome <tsoome@me.com>
  28   28   * Copyright 2017 Joyent, Inc.
  29   29   */
  30   30  
  31   31  #include <sys/zfs_context.h>
  32   32  #include <sys/fm/fs/zfs.h>
  33   33  #include <sys/spa.h>
  34   34  #include <sys/spa_impl.h>
  35      -#include <sys/bpobj.h>
  36   35  #include <sys/dmu.h>
  37   36  #include <sys/dmu_tx.h>
  38      -#include <sys/dsl_dir.h>
  39   37  #include <sys/vdev_impl.h>
  40   38  #include <sys/uberblock_impl.h>
  41   39  #include <sys/metaslab.h>
  42   40  #include <sys/metaslab_impl.h>
  43   41  #include <sys/space_map.h>
  44   42  #include <sys/space_reftree.h>
  45   43  #include <sys/zio.h>
  46   44  #include <sys/zap.h>
  47   45  #include <sys/fs/zfs.h>
  48   46  #include <sys/arc.h>
↓ open down ↓ 8 lines elided ↑ open up ↑
  57   55  static vdev_ops_t *vdev_ops_table[] = {
  58   56          &vdev_root_ops,
  59   57          &vdev_raidz_ops,
  60   58          &vdev_mirror_ops,
  61   59          &vdev_replacing_ops,
  62   60          &vdev_spare_ops,
  63   61          &vdev_disk_ops,
  64   62          &vdev_file_ops,
  65   63          &vdev_missing_ops,
  66   64          &vdev_hole_ops,
  67      -        &vdev_indirect_ops,
  68   65          NULL
  69   66  };
  70   67  
  71   68  /* maximum scrub/resilver I/O queue per leaf vdev */
  72   69  int zfs_scrub_limit = 10;
  73   70  
  74   71  /*
       72 + * alpha for exponential moving average of I/O latency (in 1/10th of a percent)
       73 + */
       74 +int zfs_vs_latency_alpha = 100;
       75 +
       76 +/*
  75   77   * When a vdev is added, it will be divided into approximately (but no
  76   78   * more than) this number of metaslabs.
  77   79   */
  78   80  int metaslabs_per_vdev = 200;
  79   81  
  80      -boolean_t vdev_validate_skip = B_FALSE;
  81      -
  82      -/*PRINTFLIKE2*/
  83      -void
  84      -vdev_dbgmsg(vdev_t *vd, const char *fmt, ...)
  85      -{
  86      -        va_list adx;
  87      -        char buf[256];
  88      -
  89      -        va_start(adx, fmt);
  90      -        (void) vsnprintf(buf, sizeof (buf), fmt, adx);
  91      -        va_end(adx);
  92      -
  93      -        if (vd->vdev_path != NULL) {
  94      -                zfs_dbgmsg("%s vdev '%s': %s", vd->vdev_ops->vdev_op_type,
  95      -                    vd->vdev_path, buf);
  96      -        } else {
  97      -                zfs_dbgmsg("%s-%llu vdev (guid %llu): %s",
  98      -                    vd->vdev_ops->vdev_op_type,
  99      -                    (u_longlong_t)vd->vdev_id,
 100      -                    (u_longlong_t)vd->vdev_guid, buf);
 101      -        }
 102      -}
 103      -
 104      -void
 105      -vdev_dbgmsg_print_tree(vdev_t *vd, int indent)
 106      -{
 107      -        char state[20];
 108      -
 109      -        if (vd->vdev_ishole || vd->vdev_ops == &vdev_missing_ops) {
 110      -                zfs_dbgmsg("%*svdev %u: %s", indent, "", vd->vdev_id,
 111      -                    vd->vdev_ops->vdev_op_type);
 112      -                return;
 113      -        }
 114      -
 115      -        switch (vd->vdev_state) {
 116      -        case VDEV_STATE_UNKNOWN:
 117      -                (void) snprintf(state, sizeof (state), "unknown");
 118      -                break;
 119      -        case VDEV_STATE_CLOSED:
 120      -                (void) snprintf(state, sizeof (state), "closed");
 121      -                break;
 122      -        case VDEV_STATE_OFFLINE:
 123      -                (void) snprintf(state, sizeof (state), "offline");
 124      -                break;
 125      -        case VDEV_STATE_REMOVED:
 126      -                (void) snprintf(state, sizeof (state), "removed");
 127      -                break;
 128      -        case VDEV_STATE_CANT_OPEN:
 129      -                (void) snprintf(state, sizeof (state), "can't open");
 130      -                break;
 131      -        case VDEV_STATE_FAULTED:
 132      -                (void) snprintf(state, sizeof (state), "faulted");
 133      -                break;
 134      -        case VDEV_STATE_DEGRADED:
 135      -                (void) snprintf(state, sizeof (state), "degraded");
 136      -                break;
 137      -        case VDEV_STATE_HEALTHY:
 138      -                (void) snprintf(state, sizeof (state), "healthy");
 139      -                break;
 140      -        default:
 141      -                (void) snprintf(state, sizeof (state), "<state %u>",
 142      -                    (uint_t)vd->vdev_state);
 143      -        }
 144      -
 145      -        zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, %s", indent,
 146      -            "", vd->vdev_id, vd->vdev_ops->vdev_op_type,
 147      -            vd->vdev_islog ? " (log)" : "",
 148      -            (u_longlong_t)vd->vdev_guid,
 149      -            vd->vdev_path ? vd->vdev_path : "N/A", state);
 150      -
 151      -        for (uint64_t i = 0; i < vd->vdev_children; i++)
 152      -                vdev_dbgmsg_print_tree(vd->vdev_child[i], indent + 2);
 153      -}
 154      -
 155   82  /*
 156   83   * Given a vdev type, return the appropriate ops vector.
 157   84   */
 158   85  static vdev_ops_t *
 159   86  vdev_getops(const char *type)
 160   87  {
 161   88          vdev_ops_t *ops, **opspp;
 162   89  
 163   90          for (opspp = vdev_ops_table; (ops = *opspp) != NULL; opspp++)
 164   91                  if (strcmp(ops->vdev_op_type, type) == 0)
 165   92                          break;
 166   93  
 167   94          return (ops);
 168   95  }
 169   96  
       97 +boolean_t
       98 +vdev_is_special(vdev_t *vd)
       99 +{
      100 +        return (vd ? vd->vdev_isspecial : B_FALSE);
      101 +}
      102 +
 170  103  /*
 171  104   * Default asize function: return the MAX of psize with the asize of
 172  105   * all children.  This is what's used by anything other than RAID-Z.
 173  106   */
 174  107  uint64_t
 175  108  vdev_default_asize(vdev_t *vd, uint64_t psize)
 176  109  {
 177  110          uint64_t asize = P2ROUNDUP(psize, 1ULL << vd->vdev_top->vdev_ashift);
 178  111          uint64_t csize;
 179  112  
↓ open down ↓ 125 lines elided ↑ open up ↑
 305  238  
 306  239          newchild = kmem_zalloc(newsize, KM_SLEEP);
 307  240          if (pvd->vdev_child != NULL) {
 308  241                  bcopy(pvd->vdev_child, newchild, oldsize);
 309  242                  kmem_free(pvd->vdev_child, oldsize);
 310  243          }
 311  244  
 312  245          pvd->vdev_child = newchild;
 313  246          pvd->vdev_child[id] = cvd;
 314  247  
      248 +        cvd->vdev_isspecial_child =
      249 +            (pvd->vdev_isspecial || pvd->vdev_isspecial_child);
      250 +
 315  251          cvd->vdev_top = (pvd->vdev_top ? pvd->vdev_top: cvd);
 316  252          ASSERT(cvd->vdev_top->vdev_parent->vdev_parent == NULL);
 317  253  
 318  254          /*
 319  255           * Walk up all ancestors to update guid sum.
 320  256           */
 321  257          for (; pvd != NULL; pvd = pvd->vdev_parent)
 322  258                  pvd->vdev_guid_sum += cvd->vdev_guid_sum;
 323  259  }
 324  260  
↓ open down ↓ 61 lines elided ↑ open up ↑
 386  322          pvd->vdev_children = newc;
 387  323  }
 388  324  
 389  325  /*
 390  326   * Allocate and minimally initialize a vdev_t.
 391  327   */
 392  328  vdev_t *
 393  329  vdev_alloc_common(spa_t *spa, uint_t id, uint64_t guid, vdev_ops_t *ops)
 394  330  {
 395  331          vdev_t *vd;
 396      -        vdev_indirect_config_t *vic;
 397  332  
 398  333          vd = kmem_zalloc(sizeof (vdev_t), KM_SLEEP);
 399      -        vic = &vd->vdev_indirect_config;
 400  334  
 401  335          if (spa->spa_root_vdev == NULL) {
 402  336                  ASSERT(ops == &vdev_root_ops);
 403  337                  spa->spa_root_vdev = vd;
 404  338                  spa->spa_load_guid = spa_generate_guid(NULL);
 405  339          }
 406  340  
 407  341          if (guid == 0 && ops != &vdev_hole_ops) {
 408  342                  if (spa->spa_root_vdev == vd) {
 409  343                          /*
↓ open down ↓ 10 lines elided ↑ open up ↑
 420  354                  ASSERT(!spa_guid_exists(spa_guid(spa), guid));
 421  355          }
 422  356  
 423  357          vd->vdev_spa = spa;
 424  358          vd->vdev_id = id;
 425  359          vd->vdev_guid = guid;
 426  360          vd->vdev_guid_sum = guid;
 427  361          vd->vdev_ops = ops;
 428  362          vd->vdev_state = VDEV_STATE_CLOSED;
 429  363          vd->vdev_ishole = (ops == &vdev_hole_ops);
 430      -        vic->vic_prev_indirect_vdev = UINT64_MAX;
 431  364  
 432      -        rw_init(&vd->vdev_indirect_rwlock, NULL, RW_DEFAULT, NULL);
 433      -        mutex_init(&vd->vdev_obsolete_lock, NULL, MUTEX_DEFAULT, NULL);
 434      -        vd->vdev_obsolete_segments = range_tree_create(NULL, NULL);
 435      -
 436  365          mutex_init(&vd->vdev_dtl_lock, NULL, MUTEX_DEFAULT, NULL);
 437  366          mutex_init(&vd->vdev_stat_lock, NULL, MUTEX_DEFAULT, NULL);
 438  367          mutex_init(&vd->vdev_probe_lock, NULL, MUTEX_DEFAULT, NULL);
 439      -        mutex_init(&vd->vdev_queue_lock, NULL, MUTEX_DEFAULT, NULL);
      368 +        mutex_init(&vd->vdev_scan_io_queue_lock, NULL, MUTEX_DEFAULT, NULL);
      369 +        rw_init(&vd->vdev_tsd_lock, NULL, RW_DEFAULT, NULL);
 440  370          for (int t = 0; t < DTL_TYPES; t++) {
 441      -                vd->vdev_dtl[t] = range_tree_create(NULL, NULL);
      371 +                vd->vdev_dtl[t] = range_tree_create(NULL, NULL,
      372 +                    &vd->vdev_dtl_lock);
 442  373          }
 443  374          txg_list_create(&vd->vdev_ms_list, spa,
 444  375              offsetof(struct metaslab, ms_txg_node));
 445  376          txg_list_create(&vd->vdev_dtl_list, spa,
 446  377              offsetof(struct vdev, vdev_dtl_node));
 447  378          vd->vdev_stat.vs_timestamp = gethrtime();
 448  379          vdev_queue_init(vd);
 449  380          vdev_cache_init(vd);
 450  381  
 451  382          return (vd);
↓ open down ↓ 3 lines elided ↑ open up ↑
 455  386   * Allocate a new vdev.  The 'alloctype' is used to control whether we are
 456  387   * creating a new vdev or loading an existing one - the behavior is slightly
 457  388   * different for each case.
 458  389   */
 459  390  int
 460  391  vdev_alloc(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent, uint_t id,
 461  392      int alloctype)
 462  393  {
 463  394          vdev_ops_t *ops;
 464  395          char *type;
 465      -        uint64_t guid = 0, islog, nparity;
      396 +        uint64_t guid = 0, nparity;
      397 +        uint64_t isspecial = 0, islog = 0;
 466  398          vdev_t *vd;
 467      -        vdev_indirect_config_t *vic;
 468  399  
 469  400          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
 470  401  
 471  402          if (nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type) != 0)
 472  403                  return (SET_ERROR(EINVAL));
 473  404  
 474  405          if ((ops = vdev_getops(type)) == NULL)
 475  406                  return (SET_ERROR(EINVAL));
 476  407  
 477  408          /*
↓ open down ↓ 22 lines elided ↑ open up ↑
 500  431  
 501  432          /*
 502  433           * The first allocated vdev must be of type 'root'.
 503  434           */
 504  435          if (ops != &vdev_root_ops && spa->spa_root_vdev == NULL)
 505  436                  return (SET_ERROR(EINVAL));
 506  437  
 507  438          /*
 508  439           * Determine whether we're a log vdev.
 509  440           */
 510      -        islog = 0;
 511  441          (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_LOG, &islog);
 512  442          if (islog && spa_version(spa) < SPA_VERSION_SLOGS)
 513  443                  return (SET_ERROR(ENOTSUP));
 514  444  
      445 +        /*
      446 +         * Determine whether we're a special vdev.
      447 +         */
      448 +        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_SPECIAL, &isspecial);
      449 +        if (isspecial && spa_version(spa) < SPA_VERSION_FEATURES)
      450 +                return (SET_ERROR(ENOTSUP));
      451 +
 515  452          if (ops == &vdev_hole_ops && spa_version(spa) < SPA_VERSION_HOLES)
 516  453                  return (SET_ERROR(ENOTSUP));
 517  454  
 518  455          /*
 519  456           * Set the nparity property for RAID-Z vdevs.
 520  457           */
 521  458          nparity = -1ULL;
 522  459          if (ops == &vdev_raidz_ops) {
 523  460                  if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_NPARITY,
 524  461                      &nparity) == 0) {
↓ open down ↓ 20 lines elided ↑ open up ↑
 545  482                           * Otherwise, we default to 1 parity device for RAID-Z.
 546  483                           */
 547  484                          nparity = 1;
 548  485                  }
 549  486          } else {
 550  487                  nparity = 0;
 551  488          }
 552  489          ASSERT(nparity != -1ULL);
 553  490  
 554  491          vd = vdev_alloc_common(spa, id, guid, ops);
 555      -        vic = &vd->vdev_indirect_config;
 556  492  
 557  493          vd->vdev_islog = islog;
      494 +        vd->vdev_isspecial = isspecial;
 558  495          vd->vdev_nparity = nparity;
      496 +        vd->vdev_isspecial_child = (parent != NULL &&
      497 +            (parent->vdev_isspecial || parent->vdev_isspecial_child));
 559  498  
 560  499          if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &vd->vdev_path) == 0)
 561  500                  vd->vdev_path = spa_strdup(vd->vdev_path);
 562  501          if (nvlist_lookup_string(nv, ZPOOL_CONFIG_DEVID, &vd->vdev_devid) == 0)
 563  502                  vd->vdev_devid = spa_strdup(vd->vdev_devid);
 564  503          if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PHYS_PATH,
 565  504              &vd->vdev_physpath) == 0)
 566  505                  vd->vdev_physpath = spa_strdup(vd->vdev_physpath);
 567  506          if (nvlist_lookup_string(nv, ZPOOL_CONFIG_FRU, &vd->vdev_fru) == 0)
 568  507                  vd->vdev_fru = spa_strdup(vd->vdev_fru);
 569  508  
      509 +#ifdef _KERNEL
      510 +        if (vd->vdev_path) {
      511 +                char dev_path[MAXPATHLEN];
      512 +                char *last_slash = NULL;
      513 +                kstat_t *exist = NULL;
      514 +
      515 +                if (strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_DISK) == 0)
      516 +                        last_slash = strrchr(vd->vdev_path, '/');
      517 +
      518 +                (void) sprintf(dev_path, "%s:%s", spa->spa_name,
      519 +                    last_slash != NULL ? last_slash + 1 : vd->vdev_path);
      520 +
      521 +                exist = kstat_hold_byname("zfs", 0, dev_path, ALL_ZONES);
      522 +
      523 +                if (!exist) {
      524 +                        vd->vdev_iokstat = kstat_create("zfs", 0, dev_path,
      525 +                            "zfs", KSTAT_TYPE_IO, 1, 0);
      526 +
      527 +                        if (vd->vdev_iokstat) {
      528 +                                vd->vdev_iokstat->ks_lock =
      529 +                                    &spa->spa_iokstat_lock;
      530 +                                kstat_install(vd->vdev_iokstat);
      531 +                        }
      532 +                } else {
      533 +                        kstat_rele(exist);
      534 +                }
      535 +        }
      536 +#endif
      537 +
 570  538          /*
 571  539           * Set the whole_disk property.  If it's not specified, leave the value
 572  540           * as -1.
 573  541           */
 574  542          if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK,
 575  543              &vd->vdev_wholedisk) != 0)
 576  544                  vd->vdev_wholedisk = -1ULL;
 577  545  
 578      -        ASSERT0(vic->vic_mapping_object);
 579      -        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_OBJECT,
 580      -            &vic->vic_mapping_object);
 581      -        ASSERT0(vic->vic_births_object);
 582      -        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_BIRTHS,
 583      -            &vic->vic_births_object);
 584      -        ASSERT3U(vic->vic_prev_indirect_vdev, ==, UINT64_MAX);
 585      -        (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_PREV_INDIRECT_VDEV,
 586      -            &vic->vic_prev_indirect_vdev);
      546 +        /*
      547 +         * Set the is_ssd property.  If it's not specified it means the media
      548 +         * is not SSD or the request failed and we assume it's not.
      549 +         */
      550 +        if (nvlist_lookup_boolean(nv, ZPOOL_CONFIG_IS_SSD) == 0)
      551 +                vd->vdev_is_ssd = B_TRUE;
      552 +        else
      553 +                vd->vdev_is_ssd = B_FALSE;
 587  554  
 588  555          /*
 589  556           * Look for the 'not present' flag.  This will only be set if the device
 590  557           * was not present at the time of import.
 591  558           */
 592  559          (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT,
 593  560              &vd->vdev_not_present);
 594  561  
 595  562          /*
 596  563           * Get the alignment requirement.
↓ open down ↓ 19 lines elided ↑ open up ↑
 616  583                      &vd->vdev_asize);
 617  584                  (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_REMOVING,
 618  585                      &vd->vdev_removing);
 619  586                  (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_VDEV_TOP_ZAP,
 620  587                      &vd->vdev_top_zap);
 621  588          } else {
 622  589                  ASSERT0(vd->vdev_top_zap);
 623  590          }
 624  591  
 625  592          if (parent && !parent->vdev_parent && alloctype != VDEV_ALLOC_ATTACH) {
      593 +                metaslab_class_t *mc = isspecial ? spa_special_class(spa) :
      594 +                    (islog ? spa_log_class(spa) : spa_normal_class(spa));
      595 +
 626  596                  ASSERT(alloctype == VDEV_ALLOC_LOAD ||
 627  597                      alloctype == VDEV_ALLOC_ADD ||
 628  598                      alloctype == VDEV_ALLOC_SPLIT ||
 629  599                      alloctype == VDEV_ALLOC_ROOTPOOL);
 630      -                vd->vdev_mg = metaslab_group_create(islog ?
 631      -                    spa_log_class(spa) : spa_normal_class(spa), vd);
      600 +
      601 +                vd->vdev_mg = metaslab_group_create(mc, vd);
 632  602          }
 633  603  
 634  604          if (vd->vdev_ops->vdev_op_leaf &&
 635  605              (alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_SPLIT)) {
 636  606                  (void) nvlist_lookup_uint64(nv,
 637  607                      ZPOOL_CONFIG_VDEV_LEAF_ZAP, &vd->vdev_leaf_zap);
 638  608          } else {
 639  609                  ASSERT0(vd->vdev_leaf_zap);
 640  610          }
 641  611  
↓ open down ↓ 61 lines elided ↑ open up ↑
 703  673  
 704  674          return (0);
 705  675  }
 706  676  
 707  677  void
 708  678  vdev_free(vdev_t *vd)
 709  679  {
 710  680          spa_t *spa = vd->vdev_spa;
 711  681  
 712  682          /*
      683 +         * Scan queues are normally destroyed at the end of a scan. If the
      684 +         * queue exists here, that implies the vdev is being removed while
      685 +         * the scan is still running.
      686 +         */
      687 +        if (vd->vdev_scan_io_queue != NULL) {
      688 +                dsl_scan_io_queue_destroy(vd->vdev_scan_io_queue);
      689 +                vd->vdev_scan_io_queue = NULL;
      690 +        }
      691 +
      692 +        /*
 713  693           * vdev_free() implies closing the vdev first.  This is simpler than
 714  694           * trying to ensure complicated semantics for all callers.
 715  695           */
 716  696          vdev_close(vd);
 717  697  
 718  698          ASSERT(!list_link_active(&vd->vdev_config_dirty_node));
 719  699          ASSERT(!list_link_active(&vd->vdev_state_dirty_node));
 720  700  
 721  701          /*
 722  702           * Free all children.
↓ open down ↓ 47 lines elided ↑ open up ↑
 770  750          txg_list_destroy(&vd->vdev_dtl_list);
 771  751  
 772  752          mutex_enter(&vd->vdev_dtl_lock);
 773  753          space_map_close(vd->vdev_dtl_sm);
 774  754          for (int t = 0; t < DTL_TYPES; t++) {
 775  755                  range_tree_vacate(vd->vdev_dtl[t], NULL, NULL);
 776  756                  range_tree_destroy(vd->vdev_dtl[t]);
 777  757          }
 778  758          mutex_exit(&vd->vdev_dtl_lock);
 779  759  
 780      -        EQUIV(vd->vdev_indirect_births != NULL,
 781      -            vd->vdev_indirect_mapping != NULL);
 782      -        if (vd->vdev_indirect_births != NULL) {
 783      -                vdev_indirect_mapping_close(vd->vdev_indirect_mapping);
 784      -                vdev_indirect_births_close(vd->vdev_indirect_births);
      760 +        if (vd->vdev_iokstat) {
      761 +                kstat_delete(vd->vdev_iokstat);
      762 +                vd->vdev_iokstat = NULL;
 785  763          }
 786      -
 787      -        if (vd->vdev_obsolete_sm != NULL) {
 788      -                ASSERT(vd->vdev_removing ||
 789      -                    vd->vdev_ops == &vdev_indirect_ops);
 790      -                space_map_close(vd->vdev_obsolete_sm);
 791      -                vd->vdev_obsolete_sm = NULL;
 792      -        }
 793      -        range_tree_destroy(vd->vdev_obsolete_segments);
 794      -        rw_destroy(&vd->vdev_indirect_rwlock);
 795      -        mutex_destroy(&vd->vdev_obsolete_lock);
 796      -
 797      -        mutex_destroy(&vd->vdev_queue_lock);
 798  764          mutex_destroy(&vd->vdev_dtl_lock);
 799  765          mutex_destroy(&vd->vdev_stat_lock);
 800  766          mutex_destroy(&vd->vdev_probe_lock);
      767 +        mutex_destroy(&vd->vdev_scan_io_queue_lock);
      768 +        rw_destroy(&vd->vdev_tsd_lock);
 801  769  
 802  770          if (vd == spa->spa_root_vdev)
 803  771                  spa->spa_root_vdev = NULL;
 804  772  
      773 +        ASSERT3P(vd->vdev_scan_io_queue, ==, NULL);
      774 +
 805  775          kmem_free(vd, sizeof (vdev_t));
 806  776  }
 807  777  
 808  778  /*
 809  779   * Transfer top-level vdev state from svd to tvd.
 810  780   */
 811  781  static void
 812  782  vdev_top_transfer(vdev_t *svd, vdev_t *tvd)
 813  783  {
 814  784          spa_t *spa = svd->vdev_spa;
↓ open down ↓ 49 lines elided ↑ open up ↑
 864  834          if (list_link_active(&svd->vdev_state_dirty_node)) {
 865  835                  vdev_state_clean(svd);
 866  836                  vdev_state_dirty(tvd);
 867  837          }
 868  838  
 869  839          tvd->vdev_deflate_ratio = svd->vdev_deflate_ratio;
 870  840          svd->vdev_deflate_ratio = 0;
 871  841  
 872  842          tvd->vdev_islog = svd->vdev_islog;
 873  843          svd->vdev_islog = 0;
      844 +
      845 +        tvd->vdev_isspecial = svd->vdev_isspecial;
      846 +        svd->vdev_isspecial = 0;
      847 +        svd->vdev_isspecial_child = tvd->vdev_isspecial;
      848 +
      849 +        dsl_scan_io_queue_vdev_xfer(svd, tvd);
 874  850  }
 875  851  
 876  852  static void
 877  853  vdev_top_update(vdev_t *tvd, vdev_t *vd)
 878  854  {
 879  855          if (vd == NULL)
 880  856                  return;
 881  857  
 882  858          vd->vdev_top = tvd;
 883  859  
↓ open down ↓ 11 lines elided ↑ open up ↑
 895  871          vdev_t *pvd = cvd->vdev_parent;
 896  872          vdev_t *mvd;
 897  873  
 898  874          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
 899  875  
 900  876          mvd = vdev_alloc_common(spa, cvd->vdev_id, 0, ops);
 901  877  
 902  878          mvd->vdev_asize = cvd->vdev_asize;
 903  879          mvd->vdev_min_asize = cvd->vdev_min_asize;
 904  880          mvd->vdev_max_asize = cvd->vdev_max_asize;
 905      -        mvd->vdev_psize = cvd->vdev_psize;
 906  881          mvd->vdev_ashift = cvd->vdev_ashift;
 907  882          mvd->vdev_state = cvd->vdev_state;
 908  883          mvd->vdev_crtxg = cvd->vdev_crtxg;
 909  884  
 910  885          vdev_remove_child(pvd, cvd);
 911  886          vdev_add_child(pvd, mvd);
 912  887          cvd->vdev_id = mvd->vdev_children;
 913  888          vdev_add_child(mvd, cvd);
 914  889          vdev_top_update(cvd->vdev_top, cvd->vdev_top);
 915  890  
↓ open down ↓ 60 lines elided ↑ open up ↑
 976  951          ASSERT(txg == 0 || spa_config_held(spa, SCL_ALLOC, RW_WRITER));
 977  952  
 978  953          /*
 979  954           * This vdev is not being allocated from yet or is a hole.
 980  955           */
 981  956          if (vd->vdev_ms_shift == 0)
 982  957                  return (0);
 983  958  
 984  959          ASSERT(!vd->vdev_ishole);
 985  960  
      961 +        /*
      962 +         * Compute the raidz-deflation ratio.  Note, we hard-code
      963 +         * in 128k (1 << 17) because it is the "typical" blocksize.
      964 +         * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
      965 +         * otherwise it would inconsistently account for existing bp's.
      966 +         */
      967 +        vd->vdev_deflate_ratio = (1 << 17) /
      968 +            (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
      969 +
 986  970          ASSERT(oldc <= newc);
 987  971  
 988  972          mspp = kmem_zalloc(newc * sizeof (*mspp), KM_SLEEP);
 989  973  
 990  974          if (oldc != 0) {
 991  975                  bcopy(vd->vdev_ms, mspp, oldc * sizeof (*mspp));
 992  976                  kmem_free(vd->vdev_ms, oldc * sizeof (*mspp));
 993  977          }
 994  978  
 995  979          vd->vdev_ms = mspp;
 996  980          vd->vdev_ms_count = newc;
 997  981  
 998  982          for (m = oldc; m < newc; m++) {
 999  983                  uint64_t object = 0;
1000  984  
1001      -                /*
1002      -                 * vdev_ms_array may be 0 if we are creating the "fake"
1003      -                 * metaslabs for an indirect vdev for zdb's leak detection.
1004      -                 * See zdb_leak_init().
1005      -                 */
1006      -                if (txg == 0 && vd->vdev_ms_array != 0) {
      985 +                if (txg == 0) {
1007  986                          error = dmu_read(mos, vd->vdev_ms_array,
1008  987                              m * sizeof (uint64_t), sizeof (uint64_t), &object,
1009  988                              DMU_READ_PREFETCH);
1010      -                        if (error != 0) {
1011      -                                vdev_dbgmsg(vd, "unable to read the metaslab "
1012      -                                    "array [error=%d]", error);
      989 +                        if (error)
1013  990                                  return (error);
1014      -                        }
1015  991                  }
1016  992  
1017  993                  error = metaslab_init(vd->vdev_mg, m, object, txg,
1018  994                      &(vd->vdev_ms[m]));
1019      -                if (error != 0) {
1020      -                        vdev_dbgmsg(vd, "metaslab_init failed [error=%d]",
1021      -                            error);
      995 +                if (error)
1022  996                          return (error);
1023      -                }
1024  997          }
1025  998  
1026  999          if (txg == 0)
1027 1000                  spa_config_enter(spa, SCL_ALLOC, FTAG, RW_WRITER);
1028 1001  
1029 1002          /*
1030 1003           * If the vdev is being removed we don't activate
1031 1004           * the metaslabs since we want to ensure that no new
1032 1005           * allocations are performed on this device.
1033 1006           */
↓ open down ↓ 2 lines elided ↑ open up ↑
1036 1009  
1037 1010          if (txg == 0)
1038 1011                  spa_config_exit(spa, SCL_ALLOC, FTAG);
1039 1012  
1040 1013          return (0);
1041 1014  }
1042 1015  
1043 1016  void
1044 1017  vdev_metaslab_fini(vdev_t *vd)
1045 1018  {
1046      -        if (vd->vdev_ms != NULL) {
1047      -                uint64_t count = vd->vdev_ms_count;
     1019 +        uint64_t m;
     1020 +        uint64_t count = vd->vdev_ms_count;
1048 1021  
     1022 +        if (vd->vdev_ms != NULL) {
1049 1023                  metaslab_group_passivate(vd->vdev_mg);
1050      -                for (uint64_t m = 0; m < count; m++) {
     1024 +                for (m = 0; m < count; m++) {
1051 1025                          metaslab_t *msp = vd->vdev_ms[m];
1052 1026  
1053 1027                          if (msp != NULL)
1054 1028                                  metaslab_fini(msp);
1055 1029                  }
1056 1030                  kmem_free(vd->vdev_ms, count * sizeof (metaslab_t *));
1057 1031                  vd->vdev_ms = NULL;
1058      -
1059      -                vd->vdev_ms_count = 0;
1060 1032          }
1061      -        ASSERT0(vd->vdev_ms_count);
1062 1033  }
1063 1034  
1064 1035  typedef struct vdev_probe_stats {
1065 1036          boolean_t       vps_readable;
1066 1037          boolean_t       vps_writeable;
1067 1038          int             vps_flags;
1068 1039  } vdev_probe_stats_t;
1069 1040  
1070 1041  static void
1071 1042  vdev_probe_done(zio_t *zio)
↓ open down ↓ 23 lines elided ↑ open up ↑
1095 1066                  zio_t *pio;
1096 1067  
1097 1068                  vd->vdev_cant_read |= !vps->vps_readable;
1098 1069                  vd->vdev_cant_write |= !vps->vps_writeable;
1099 1070  
1100 1071                  if (vdev_readable(vd) &&
1101 1072                      (vdev_writeable(vd) || !spa_writeable(spa))) {
1102 1073                          zio->io_error = 0;
1103 1074                  } else {
1104 1075                          ASSERT(zio->io_error != 0);
1105      -                        vdev_dbgmsg(vd, "failed probe");
1106 1076                          zfs_ereport_post(FM_EREPORT_ZFS_PROBE_FAILURE,
1107 1077                              spa, vd, NULL, 0, 0);
1108 1078                          zio->io_error = SET_ERROR(ENXIO);
1109 1079                  }
1110 1080  
1111 1081                  mutex_enter(&vd->vdev_probe_lock);
1112 1082                  ASSERT(vd->vdev_probe_zio == zio);
1113 1083                  vd->vdev_probe_zio = NULL;
1114 1084                  mutex_exit(&vd->vdev_probe_lock);
1115 1085  
↓ open down ↓ 147 lines elided ↑ open up ↑
1263 1233              children, children, TASKQ_PREPOPULATE);
1264 1234  
1265 1235          for (int c = 0; c < children; c++)
1266 1236                  VERIFY(taskq_dispatch(tq, vdev_open_child, vd->vdev_child[c],
1267 1237                      TQ_SLEEP) != NULL);
1268 1238  
1269 1239          taskq_destroy(tq);
1270 1240  }
1271 1241  
1272 1242  /*
1273      - * Compute the raidz-deflation ratio.  Note, we hard-code
1274      - * in 128k (1 << 17) because it is the "typical" blocksize.
1275      - * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
1276      - * otherwise it would inconsistently account for existing bp's.
1277      - */
1278      -static void
1279      -vdev_set_deflate_ratio(vdev_t *vd)
1280      -{
1281      -        if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) {
1282      -                vd->vdev_deflate_ratio = (1 << 17) /
1283      -                    (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
1284      -        }
1285      -}
1286      -
1287      -/*
1288 1243   * Prepare a virtual device for access.
1289 1244   */
1290 1245  int
1291 1246  vdev_open(vdev_t *vd)
1292 1247  {
1293 1248          spa_t *spa = vd->vdev_spa;
1294 1249          int error;
1295 1250          uint64_t osize = 0;
1296 1251          uint64_t max_osize = 0;
1297 1252          uint64_t asize, max_asize, psize;
↓ open down ↓ 4 lines elided ↑ open up ↑
1302 1257          ASSERT(vd->vdev_state == VDEV_STATE_CLOSED ||
1303 1258              vd->vdev_state == VDEV_STATE_CANT_OPEN ||
1304 1259              vd->vdev_state == VDEV_STATE_OFFLINE);
1305 1260  
1306 1261          vd->vdev_stat.vs_aux = VDEV_AUX_NONE;
1307 1262          vd->vdev_cant_read = B_FALSE;
1308 1263          vd->vdev_cant_write = B_FALSE;
1309 1264          vd->vdev_min_asize = vdev_get_min_asize(vd);
1310 1265  
1311 1266          /*
1312      -         * If this vdev is not removed, check its fault status.  If it's
1313      -         * faulted, bail out of the open.
     1267 +         * If vdev isn't removed and is faulted for reasons other than failed
     1268 +         * open, or if it's offline - bail out.
1314 1269           */
1315      -        if (!vd->vdev_removed && vd->vdev_faulted) {
     1270 +        if (!vd->vdev_removed && vd->vdev_faulted &&
     1271 +            vd->vdev_label_aux != VDEV_AUX_OPEN_FAILED) {
1316 1272                  ASSERT(vd->vdev_children == 0);
1317 1273                  ASSERT(vd->vdev_label_aux == VDEV_AUX_ERR_EXCEEDED ||
1318 1274                      vd->vdev_label_aux == VDEV_AUX_EXTERNAL);
1319 1275                  vdev_set_state(vd, B_TRUE, VDEV_STATE_FAULTED,
1320 1276                      vd->vdev_label_aux);
1321 1277                  return (SET_ERROR(ENXIO));
1322 1278          } else if (vd->vdev_offline) {
1323 1279                  ASSERT(vd->vdev_children == 0);
1324 1280                  vdev_set_state(vd, B_TRUE, VDEV_STATE_OFFLINE, VDEV_AUX_NONE);
1325 1281                  return (SET_ERROR(ENXIO));
↓ open down ↓ 7 lines elided ↑ open up ↑
1333 1289           */
1334 1290          vd->vdev_reopening = B_FALSE;
1335 1291          if (zio_injection_enabled && error == 0)
1336 1292                  error = zio_handle_device_injection(vd, NULL, ENXIO);
1337 1293  
1338 1294          if (error) {
1339 1295                  if (vd->vdev_removed &&
1340 1296                      vd->vdev_stat.vs_aux != VDEV_AUX_OPEN_FAILED)
1341 1297                          vd->vdev_removed = B_FALSE;
1342 1298  
1343      -                if (vd->vdev_stat.vs_aux == VDEV_AUX_CHILDREN_OFFLINE) {
1344      -                        vdev_set_state(vd, B_TRUE, VDEV_STATE_OFFLINE,
1345      -                            vd->vdev_stat.vs_aux);
1346      -                } else {
1347      -                        vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
1348      -                            vd->vdev_stat.vs_aux);
1349      -                }
     1299 +                vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
     1300 +                    vd->vdev_stat.vs_aux);
1350 1301                  return (error);
1351 1302          }
1352 1303  
1353 1304          vd->vdev_removed = B_FALSE;
1354 1305  
1355 1306          /*
1356 1307           * Recheck the faulted flag now that we have confirmed that
1357 1308           * the vdev is accessible.  If we're faulted, bail.
1358 1309           */
1359 1310          if (vd->vdev_faulted) {
↓ open down ↓ 136 lines elided ↑ open up ↑
1496 1447           */
1497 1448          if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
1498 1449              vdev_resilver_needed(vd, NULL, NULL))
1499 1450                  spa_async_request(spa, SPA_ASYNC_RESILVER);
1500 1451  
1501 1452          return (0);
1502 1453  }
1503 1454  
1504 1455  /*
1505 1456   * Called once the vdevs are all opened, this routine validates the label
1506      - * contents. This needs to be done before vdev_load() so that we don't
     1457 + * contents.  This needs to be done before vdev_load() so that we don't
1507 1458   * inadvertently do repair I/Os to the wrong device.
1508 1459   *
     1460 + * If 'strict' is false ignore the spa guid check. This is necessary because
     1461 + * if the machine crashed during a re-guid the new guid might have been written
     1462 + * to all of the vdev labels, but not the cached config. The strict check
     1463 + * will be performed when the pool is opened again using the mos config.
     1464 + *
1509 1465   * This function will only return failure if one of the vdevs indicates that it
1510 1466   * has since been destroyed or exported.  This is only possible if
1511 1467   * /etc/zfs/zpool.cache was readonly at the time.  Otherwise, the vdev state
1512 1468   * will be updated but the function will return 0.
1513 1469   */
1514 1470  int
1515      -vdev_validate(vdev_t *vd)
     1471 +vdev_validate(vdev_t *vd, boolean_t strict)
1516 1472  {
1517 1473          spa_t *spa = vd->vdev_spa;
1518 1474          nvlist_t *label;
1519      -        uint64_t guid = 0, aux_guid = 0, top_guid;
     1475 +        uint64_t guid = 0, top_guid;
1520 1476          uint64_t state;
1521      -        nvlist_t *nvl;
1522      -        uint64_t txg;
1523 1477  
1524      -        if (vdev_validate_skip)
1525      -                return (0);
1526      -
1527      -        for (uint64_t c = 0; c < vd->vdev_children; c++)
1528      -                if (vdev_validate(vd->vdev_child[c]) != 0)
     1478 +        for (int c = 0; c < vd->vdev_children; c++)
     1479 +                if (vdev_validate(vd->vdev_child[c], strict) != 0)
1529 1480                          return (SET_ERROR(EBADF));
1530 1481  
1531 1482          /*
1532 1483           * If the device has already failed, or was marked offline, don't do
1533 1484           * any further validation.  Otherwise, label I/O will fail and we will
1534 1485           * overwrite the previous state.
1535 1486           */
1536      -        if (!vd->vdev_ops->vdev_op_leaf || !vdev_readable(vd))
1537      -                return (0);
     1487 +        if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) {
     1488 +                uint64_t aux_guid = 0;
     1489 +                nvlist_t *nvl;
     1490 +                uint64_t txg = spa_last_synced_txg(spa) != 0 ?
     1491 +                    spa_last_synced_txg(spa) : -1ULL;
1538 1492  
1539      -        /*
1540      -         * If we are performing an extreme rewind, we allow for a label that
1541      -         * was modified at a point after the current txg.
1542      -         */
1543      -        if (spa->spa_extreme_rewind || spa_last_synced_txg(spa) == 0)
1544      -                txg = UINT64_MAX;
1545      -        else
1546      -                txg = spa_last_synced_txg(spa);
     1493 +                if ((label = vdev_label_read_config(vd, txg)) == NULL) {
     1494 +                        vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
     1495 +                            VDEV_AUX_BAD_LABEL);
     1496 +                        return (0);
     1497 +                }
1547 1498  
1548      -        if ((label = vdev_label_read_config(vd, txg)) == NULL) {
1549      -                vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
1550      -                    VDEV_AUX_BAD_LABEL);
1551      -                vdev_dbgmsg(vd, "vdev_validate: failed reading config");
1552      -                return (0);
1553      -        }
     1499 +                /*
     1500 +                 * Determine if this vdev has been split off into another
     1501 +                 * pool.  If so, then refuse to open it.
     1502 +                 */
     1503 +                if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_SPLIT_GUID,
     1504 +                    &aux_guid) == 0 && aux_guid == spa_guid(spa)) {
     1505 +                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
     1506 +                            VDEV_AUX_SPLIT_POOL);
     1507 +                        nvlist_free(label);
     1508 +                        return (0);
     1509 +                }
1554 1510  
1555      -        /*
1556      -         * Determine if this vdev has been split off into another
1557      -         * pool.  If so, then refuse to open it.
1558      -         */
1559      -        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_SPLIT_GUID,
1560      -            &aux_guid) == 0 && aux_guid == spa_guid(spa)) {
1561      -                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1562      -                    VDEV_AUX_SPLIT_POOL);
1563      -                nvlist_free(label);
1564      -                vdev_dbgmsg(vd, "vdev_validate: vdev split into other pool");
1565      -                return (0);
1566      -        }
     1511 +                if (strict && (nvlist_lookup_uint64(label,
     1512 +                    ZPOOL_CONFIG_POOL_GUID, &guid) != 0 ||
     1513 +                    guid != spa_guid(spa))) {
     1514 +                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
     1515 +                            VDEV_AUX_CORRUPT_DATA);
     1516 +                        nvlist_free(label);
     1517 +                        return (0);
     1518 +                }
1567 1519  
1568      -        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID, &guid) != 0) {
1569      -                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1570      -                    VDEV_AUX_CORRUPT_DATA);
1571      -                nvlist_free(label);
1572      -                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
1573      -                    ZPOOL_CONFIG_POOL_GUID);
1574      -                return (0);
1575      -        }
     1520 +                if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_VDEV_TREE, &nvl)
     1521 +                    != 0 || nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_ORIG_GUID,
     1522 +                    &aux_guid) != 0)
     1523 +                        aux_guid = 0;
1576 1524  
1577      -        /*
1578      -         * If config is not trusted then ignore the spa guid check. This is
1579      -         * necessary because if the machine crashed during a re-guid the new
1580      -         * guid might have been written to all of the vdev labels, but not the
1581      -         * cached config. The check will be performed again once we have the
1582      -         * trusted config from the MOS.
1583      -         */
1584      -        if (spa->spa_trust_config && guid != spa_guid(spa)) {
1585      -                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1586      -                    VDEV_AUX_CORRUPT_DATA);
1587      -                nvlist_free(label);
1588      -                vdev_dbgmsg(vd, "vdev_validate: vdev label pool_guid doesn't "
1589      -                    "match config (%llu != %llu)", (u_longlong_t)guid,
1590      -                    (u_longlong_t)spa_guid(spa));
1591      -                return (0);
1592      -        }
1593      -
1594      -        if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_VDEV_TREE, &nvl)
1595      -            != 0 || nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_ORIG_GUID,
1596      -            &aux_guid) != 0)
1597      -                aux_guid = 0;
1598      -
1599      -        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, &guid) != 0) {
1600      -                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1601      -                    VDEV_AUX_CORRUPT_DATA);
1602      -                nvlist_free(label);
1603      -                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
1604      -                    ZPOOL_CONFIG_GUID);
1605      -                return (0);
1606      -        }
1607      -
1608      -        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID, &top_guid)
1609      -            != 0) {
1610      -                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1611      -                    VDEV_AUX_CORRUPT_DATA);
1612      -                nvlist_free(label);
1613      -                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
1614      -                    ZPOOL_CONFIG_TOP_GUID);
1615      -                return (0);
1616      -        }
1617      -
1618      -        /*
1619      -         * If this vdev just became a top-level vdev because its sibling was
1620      -         * detached, it will have adopted the parent's vdev guid -- but the
1621      -         * label may or may not be on disk yet. Fortunately, either version
1622      -         * of the label will have the same top guid, so if we're a top-level
1623      -         * vdev, we can safely compare to that instead.
1624      -         * However, if the config comes from a cachefile that failed to update
1625      -         * after the detach, a top-level vdev will appear as a non top-level
1626      -         * vdev in the config. Also relax the constraints if we perform an
1627      -         * extreme rewind.
1628      -         *
1629      -         * If we split this vdev off instead, then we also check the
1630      -         * original pool's guid. We don't want to consider the vdev
1631      -         * corrupt if it is partway through a split operation.
1632      -         */
1633      -        if (vd->vdev_guid != guid && vd->vdev_guid != aux_guid) {
1634      -                boolean_t mismatch = B_FALSE;
1635      -                if (spa->spa_trust_config && !spa->spa_extreme_rewind) {
1636      -                        if (vd != vd->vdev_top || vd->vdev_guid != top_guid)
1637      -                                mismatch = B_TRUE;
1638      -                } else {
1639      -                        if (vd->vdev_guid != top_guid &&
1640      -                            vd->vdev_top->vdev_guid != guid)
1641      -                                mismatch = B_TRUE;
     1525 +                /*
     1526 +                 * If this vdev just became a top-level vdev because its
     1527 +                 * sibling was detached, it will have adopted the parent's
     1528 +                 * vdev guid -- but the label may or may not be on disk yet.
     1529 +                 * Fortunately, either version of the label will have the
     1530 +                 * same top guid, so if we're a top-level vdev, we can
     1531 +                 * safely compare to that instead.
     1532 +                 *
     1533 +                 * If we split this vdev off instead, then we also check the
     1534 +                 * original pool's guid.  We don't want to consider the vdev
     1535 +                 * corrupt if it is partway through a split operation.
     1536 +                 */
     1537 +                if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID,
     1538 +                    &guid) != 0 ||
     1539 +                    nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID,
     1540 +                    &top_guid) != 0 ||
     1541 +                    ((vd->vdev_guid != guid && vd->vdev_guid != aux_guid) &&
     1542 +                    (vd->vdev_guid != top_guid || vd != vd->vdev_top))) {
     1543 +                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
     1544 +                            VDEV_AUX_CORRUPT_DATA);
     1545 +                        nvlist_free(label);
     1546 +                        return (0);
1642 1547                  }
1643 1548  
1644      -                if (mismatch) {
     1549 +                if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE,
     1550 +                    &state) != 0) {
1645 1551                          vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1646 1552                              VDEV_AUX_CORRUPT_DATA);
1647 1553                          nvlist_free(label);
1648      -                        vdev_dbgmsg(vd, "vdev_validate: config guid "
1649      -                            "doesn't match label guid");
1650      -                        vdev_dbgmsg(vd, "CONFIG: guid %llu, top_guid %llu",
1651      -                            (u_longlong_t)vd->vdev_guid,
1652      -                            (u_longlong_t)vd->vdev_top->vdev_guid);
1653      -                        vdev_dbgmsg(vd, "LABEL: guid %llu, top_guid %llu, "
1654      -                            "aux_guid %llu", (u_longlong_t)guid,
1655      -                            (u_longlong_t)top_guid, (u_longlong_t)aux_guid);
1656 1554                          return (0);
1657 1555                  }
1658      -        }
1659 1556  
1660      -        if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE,
1661      -            &state) != 0) {
1662      -                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
1663      -                    VDEV_AUX_CORRUPT_DATA);
1664 1557                  nvlist_free(label);
1665      -                vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
1666      -                    ZPOOL_CONFIG_POOL_STATE);
1667      -                return (0);
1668      -        }
1669 1558  
1670      -        nvlist_free(label);
     1559 +                /*
     1560 +                 * If this is a verbatim import, no need to check the
     1561 +                 * state of the pool.
     1562 +                 */
     1563 +                if (!(spa->spa_import_flags & ZFS_IMPORT_VERBATIM) &&
     1564 +                    spa_load_state(spa) == SPA_LOAD_OPEN &&
     1565 +                    state != POOL_STATE_ACTIVE)
     1566 +                        return (SET_ERROR(EBADF));
1671 1567  
1672      -        /*
1673      -         * If this is a verbatim import, no need to check the
1674      -         * state of the pool.
1675      -         */
1676      -        if (!(spa->spa_import_flags & ZFS_IMPORT_VERBATIM) &&
1677      -            spa_load_state(spa) == SPA_LOAD_OPEN &&
1678      -            state != POOL_STATE_ACTIVE) {
1679      -                vdev_dbgmsg(vd, "vdev_validate: invalid pool state (%llu) "
1680      -                    "for spa %s", (u_longlong_t)state, spa->spa_name);
1681      -                return (SET_ERROR(EBADF));
     1568 +                /*
     1569 +                 * If we were able to open and validate a vdev that was
     1570 +                 * previously marked permanently unavailable, clear that state
     1571 +                 * now.
     1572 +                 */
     1573 +                if (vd->vdev_not_present)
     1574 +                        vd->vdev_not_present = 0;
1682 1575          }
1683 1576  
1684      -        /*
1685      -         * If we were able to open and validate a vdev that was
1686      -         * previously marked permanently unavailable, clear that state
1687      -         * now.
1688      -         */
1689      -        if (vd->vdev_not_present)
1690      -                vd->vdev_not_present = 0;
1691      -
1692 1577          return (0);
1693 1578  }
1694 1579  
1695      -static void
1696      -vdev_copy_path_impl(vdev_t *svd, vdev_t *dvd)
1697      -{
1698      -        if (svd->vdev_path != NULL && dvd->vdev_path != NULL) {
1699      -                if (strcmp(svd->vdev_path, dvd->vdev_path) != 0) {
1700      -                        zfs_dbgmsg("vdev_copy_path: vdev %llu: path changed "
1701      -                            "from '%s' to '%s'", (u_longlong_t)dvd->vdev_guid,
1702      -                            dvd->vdev_path, svd->vdev_path);
1703      -                        spa_strfree(dvd->vdev_path);
1704      -                        dvd->vdev_path = spa_strdup(svd->vdev_path);
1705      -                }
1706      -        } else if (svd->vdev_path != NULL) {
1707      -                dvd->vdev_path = spa_strdup(svd->vdev_path);
1708      -                zfs_dbgmsg("vdev_copy_path: vdev %llu: path set to '%s'",
1709      -                    (u_longlong_t)dvd->vdev_guid, dvd->vdev_path);
1710      -        }
1711      -}
1712      -
1713 1580  /*
1714      - * Recursively copy vdev paths from one vdev to another. Source and destination
1715      - * vdev trees must have same geometry otherwise return error. Intended to copy
1716      - * paths from userland config into MOS config.
1717      - */
1718      -int
1719      -vdev_copy_path_strict(vdev_t *svd, vdev_t *dvd)
1720      -{
1721      -        if ((svd->vdev_ops == &vdev_missing_ops) ||
1722      -            (svd->vdev_ishole && dvd->vdev_ishole) ||
1723      -            (dvd->vdev_ops == &vdev_indirect_ops))
1724      -                return (0);
1725      -
1726      -        if (svd->vdev_ops != dvd->vdev_ops) {
1727      -                vdev_dbgmsg(svd, "vdev_copy_path: vdev type mismatch: %s != %s",
1728      -                    svd->vdev_ops->vdev_op_type, dvd->vdev_ops->vdev_op_type);
1729      -                return (SET_ERROR(EINVAL));
1730      -        }
1731      -
1732      -        if (svd->vdev_guid != dvd->vdev_guid) {
1733      -                vdev_dbgmsg(svd, "vdev_copy_path: guids mismatch (%llu != "
1734      -                    "%llu)", (u_longlong_t)svd->vdev_guid,
1735      -                    (u_longlong_t)dvd->vdev_guid);
1736      -                return (SET_ERROR(EINVAL));
1737      -        }
1738      -
1739      -        if (svd->vdev_children != dvd->vdev_children) {
1740      -                vdev_dbgmsg(svd, "vdev_copy_path: children count mismatch: "
1741      -                    "%llu != %llu", (u_longlong_t)svd->vdev_children,
1742      -                    (u_longlong_t)dvd->vdev_children);
1743      -                return (SET_ERROR(EINVAL));
1744      -        }
1745      -
1746      -        for (uint64_t i = 0; i < svd->vdev_children; i++) {
1747      -                int error = vdev_copy_path_strict(svd->vdev_child[i],
1748      -                    dvd->vdev_child[i]);
1749      -                if (error != 0)
1750      -                        return (error);
1751      -        }
1752      -
1753      -        if (svd->vdev_ops->vdev_op_leaf)
1754      -                vdev_copy_path_impl(svd, dvd);
1755      -
1756      -        return (0);
1757      -}
1758      -
1759      -static void
1760      -vdev_copy_path_search(vdev_t *stvd, vdev_t *dvd)
1761      -{
1762      -        ASSERT(stvd->vdev_top == stvd);
1763      -        ASSERT3U(stvd->vdev_id, ==, dvd->vdev_top->vdev_id);
1764      -
1765      -        for (uint64_t i = 0; i < dvd->vdev_children; i++) {
1766      -                vdev_copy_path_search(stvd, dvd->vdev_child[i]);
1767      -        }
1768      -
1769      -        if (!dvd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(dvd))
1770      -                return;
1771      -
1772      -        /*
1773      -         * The idea here is that while a vdev can shift positions within
1774      -         * a top vdev (when replacing, attaching mirror, etc.) it cannot
1775      -         * step outside of it.
1776      -         */
1777      -        vdev_t *vd = vdev_lookup_by_guid(stvd, dvd->vdev_guid);
1778      -
1779      -        if (vd == NULL || vd->vdev_ops != dvd->vdev_ops)
1780      -                return;
1781      -
1782      -        ASSERT(vd->vdev_ops->vdev_op_leaf);
1783      -
1784      -        vdev_copy_path_impl(vd, dvd);
1785      -}
1786      -
1787      -/*
1788      - * Recursively copy vdev paths from one root vdev to another. Source and
1789      - * destination vdev trees may differ in geometry. For each destination leaf
1790      - * vdev, search a vdev with the same guid and top vdev id in the source.
1791      - * Intended to copy paths from userland config into MOS config.
1792      - */
1793      -void
1794      -vdev_copy_path_relaxed(vdev_t *srvd, vdev_t *drvd)
1795      -{
1796      -        uint64_t children = MIN(srvd->vdev_children, drvd->vdev_children);
1797      -        ASSERT(srvd->vdev_ops == &vdev_root_ops);
1798      -        ASSERT(drvd->vdev_ops == &vdev_root_ops);
1799      -
1800      -        for (uint64_t i = 0; i < children; i++) {
1801      -                vdev_copy_path_search(srvd->vdev_child[i],
1802      -                    drvd->vdev_child[i]);
1803      -        }
1804      -}
1805      -
1806      -/*
1807 1581   * Close a virtual device.
1808 1582   */
1809 1583  void
1810 1584  vdev_close(vdev_t *vd)
1811 1585  {
1812 1586          spa_t *spa = vd->vdev_spa;
1813 1587          vdev_t *pvd = vd->vdev_parent;
1814 1588  
1815 1589          ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
1816 1590  
↓ open down ↓ 71 lines elided ↑ open up ↑
1888 1662  
1889 1663          /*
1890 1664           * Call vdev_validate() here to make sure we have the same device.
1891 1665           * Otherwise, a device with an invalid label could be successfully
1892 1666           * opened in response to vdev_reopen().
1893 1667           */
1894 1668          if (vd->vdev_aux) {
1895 1669                  (void) vdev_validate_aux(vd);
1896 1670                  if (vdev_readable(vd) && vdev_writeable(vd) &&
1897 1671                      vd->vdev_aux == &spa->spa_l2cache &&
1898      -                    !l2arc_vdev_present(vd))
1899      -                        l2arc_add_vdev(spa, vd);
     1672 +                    !l2arc_vdev_present(vd)) {
     1673 +                        /*
     1674 +                         * When reopening we can assume persistent L2ARC is
     1675 +                         * supported, since we've already opened the device
     1676 +                         * in the past and prepended an L2ARC uberblock.
     1677 +                         */
     1678 +                        l2arc_add_vdev(spa, vd, B_TRUE);
     1679 +                }
1900 1680          } else {
1901      -                (void) vdev_validate(vd);
     1681 +                (void) vdev_validate(vd, B_TRUE);
1902 1682          }
1903 1683  
1904 1684          /*
1905 1685           * Reassess parent vdev's health.
1906 1686           */
1907 1687          vdev_propagate_state(vd);
1908 1688  }
1909 1689  
1910 1690  int
1911 1691  vdev_create(vdev_t *vd, uint64_t txg, boolean_t isreplacing)
↓ open down ↓ 32 lines elided ↑ open up ↑
1944 1724           * Aim for roughly metaslabs_per_vdev (default 200) metaslabs per vdev.
1945 1725           */
1946 1726          vd->vdev_ms_shift = highbit64(vd->vdev_asize / metaslabs_per_vdev);
1947 1727          vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);
1948 1728  }
1949 1729  
1950 1730  void
1951 1731  vdev_dirty(vdev_t *vd, int flags, void *arg, uint64_t txg)
1952 1732  {
1953 1733          ASSERT(vd == vd->vdev_top);
1954      -        /* indirect vdevs don't have metaslabs or dtls */
1955      -        ASSERT(vdev_is_concrete(vd) || flags == 0);
     1734 +        ASSERT(!vd->vdev_ishole);
1956 1735          ASSERT(ISP2(flags));
1957 1736          ASSERT(spa_writeable(vd->vdev_spa));
1958 1737  
1959 1738          if (flags & VDD_METASLAB)
1960 1739                  (void) txg_list_add(&vd->vdev_ms_list, arg, txg);
1961 1740  
1962 1741          if (flags & VDD_DTL)
1963 1742                  (void) txg_list_add(&vd->vdev_dtl_list, arg, txg);
1964 1743  
1965 1744          (void) txg_list_add(&vd->vdev_spa->spa_vdev_txg_list, vd, txg);
↓ open down ↓ 49 lines elided ↑ open up ↑
2015 1794   */
2016 1795  void
2017 1796  vdev_dtl_dirty(vdev_t *vd, vdev_dtl_type_t t, uint64_t txg, uint64_t size)
2018 1797  {
2019 1798          range_tree_t *rt = vd->vdev_dtl[t];
2020 1799  
2021 1800          ASSERT(t < DTL_TYPES);
2022 1801          ASSERT(vd != vd->vdev_spa->spa_root_vdev);
2023 1802          ASSERT(spa_writeable(vd->vdev_spa));
2024 1803  
2025      -        mutex_enter(&vd->vdev_dtl_lock);
     1804 +        mutex_enter(rt->rt_lock);
2026 1805          if (!range_tree_contains(rt, txg, size))
2027 1806                  range_tree_add(rt, txg, size);
2028      -        mutex_exit(&vd->vdev_dtl_lock);
     1807 +        mutex_exit(rt->rt_lock);
2029 1808  }
2030 1809  
2031 1810  boolean_t
2032 1811  vdev_dtl_contains(vdev_t *vd, vdev_dtl_type_t t, uint64_t txg, uint64_t size)
2033 1812  {
2034 1813          range_tree_t *rt = vd->vdev_dtl[t];
2035 1814          boolean_t dirty = B_FALSE;
2036 1815  
2037 1816          ASSERT(t < DTL_TYPES);
2038 1817          ASSERT(vd != vd->vdev_spa->spa_root_vdev);
2039 1818  
2040      -        /*
2041      -         * While we are loading the pool, the DTLs have not been loaded yet.
2042      -         * Ignore the DTLs and try all devices.  This avoids a recursive
2043      -         * mutex enter on the vdev_dtl_lock, and also makes us try hard
2044      -         * when loading the pool (relying on the checksum to ensure that
2045      -         * we get the right data -- note that we while loading, we are
2046      -         * only reading the MOS, which is always checksummed).
2047      -         */
2048      -        if (vd->vdev_spa->spa_load_state != SPA_LOAD_NONE)
2049      -                return (B_FALSE);
2050      -
2051      -        mutex_enter(&vd->vdev_dtl_lock);
     1819 +        mutex_enter(rt->rt_lock);
2052 1820          if (range_tree_space(rt) != 0)
2053 1821                  dirty = range_tree_contains(rt, txg, size);
2054      -        mutex_exit(&vd->vdev_dtl_lock);
     1822 +        mutex_exit(rt->rt_lock);
2055 1823  
2056 1824          return (dirty);
2057 1825  }
2058 1826  
2059 1827  boolean_t
2060 1828  vdev_dtl_empty(vdev_t *vd, vdev_dtl_type_t t)
2061 1829  {
2062 1830          range_tree_t *rt = vd->vdev_dtl[t];
2063 1831          boolean_t empty;
2064 1832  
2065      -        mutex_enter(&vd->vdev_dtl_lock);
     1833 +        mutex_enter(rt->rt_lock);
2066 1834          empty = (range_tree_space(rt) == 0);
2067      -        mutex_exit(&vd->vdev_dtl_lock);
     1835 +        mutex_exit(rt->rt_lock);
2068 1836  
2069 1837          return (empty);
2070 1838  }
2071 1839  
2072 1840  /*
2073 1841   * Returns the lowest txg in the DTL range.
2074 1842   */
2075 1843  static uint64_t
2076 1844  vdev_dtl_min(vdev_t *vd)
2077 1845  {
↓ open down ↓ 72 lines elided ↑ open up ↑
2150 1918          spa_t *spa = vd->vdev_spa;
2151 1919          avl_tree_t reftree;
2152 1920          int minref;
2153 1921  
2154 1922          ASSERT(spa_config_held(spa, SCL_ALL, RW_READER) != 0);
2155 1923  
2156 1924          for (int c = 0; c < vd->vdev_children; c++)
2157 1925                  vdev_dtl_reassess(vd->vdev_child[c], txg,
2158 1926                      scrub_txg, scrub_done);
2159 1927  
2160      -        if (vd == spa->spa_root_vdev || !vdev_is_concrete(vd) || vd->vdev_aux)
     1928 +        if (vd == spa->spa_root_vdev || vd->vdev_ishole || vd->vdev_aux)
2161 1929                  return;
2162 1930  
2163 1931          if (vd->vdev_ops->vdev_op_leaf) {
2164 1932                  dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan;
2165 1933  
2166 1934                  mutex_enter(&vd->vdev_dtl_lock);
2167 1935  
2168 1936                  /*
2169 1937                   * If we've completed a scan cleanly then determine
2170 1938                   * if this vdev should remove any DTLs. We only want to
↓ open down ↓ 85 lines elided ↑ open up ↑
2256 2024  }
2257 2025  
2258 2026  int
2259 2027  vdev_dtl_load(vdev_t *vd)
2260 2028  {
2261 2029          spa_t *spa = vd->vdev_spa;
2262 2030          objset_t *mos = spa->spa_meta_objset;
2263 2031          int error = 0;
2264 2032  
2265 2033          if (vd->vdev_ops->vdev_op_leaf && vd->vdev_dtl_object != 0) {
2266      -                ASSERT(vdev_is_concrete(vd));
     2034 +                ASSERT(!vd->vdev_ishole);
2267 2035  
2268 2036                  error = space_map_open(&vd->vdev_dtl_sm, mos,
2269      -                    vd->vdev_dtl_object, 0, -1ULL, 0);
     2037 +                    vd->vdev_dtl_object, 0, -1ULL, 0, &vd->vdev_dtl_lock);
2270 2038                  if (error)
2271 2039                          return (error);
2272 2040                  ASSERT(vd->vdev_dtl_sm != NULL);
2273 2041  
2274 2042                  mutex_enter(&vd->vdev_dtl_lock);
2275 2043  
2276 2044                  /*
2277 2045                   * Now that we've opened the space_map we need to update
2278 2046                   * the in-core DTL.
2279 2047                   */
↓ open down ↓ 58 lines elided ↑ open up ↑
2338 2106          }
2339 2107  }
2340 2108  
2341 2109  void
2342 2110  vdev_dtl_sync(vdev_t *vd, uint64_t txg)
2343 2111  {
2344 2112          spa_t *spa = vd->vdev_spa;
2345 2113          range_tree_t *rt = vd->vdev_dtl[DTL_MISSING];
2346 2114          objset_t *mos = spa->spa_meta_objset;
2347 2115          range_tree_t *rtsync;
     2116 +        kmutex_t rtlock;
2348 2117          dmu_tx_t *tx;
2349 2118          uint64_t object = space_map_object(vd->vdev_dtl_sm);
2350 2119  
2351      -        ASSERT(vdev_is_concrete(vd));
     2120 +        ASSERT(!vd->vdev_ishole);
2352 2121          ASSERT(vd->vdev_ops->vdev_op_leaf);
2353 2122  
2354 2123          tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
2355 2124  
2356 2125          if (vd->vdev_detached || vd->vdev_top->vdev_removing) {
2357 2126                  mutex_enter(&vd->vdev_dtl_lock);
2358 2127                  space_map_free(vd->vdev_dtl_sm, tx);
2359 2128                  space_map_close(vd->vdev_dtl_sm);
2360 2129                  vd->vdev_dtl_sm = NULL;
2361 2130                  mutex_exit(&vd->vdev_dtl_lock);
2362 2131  
2363 2132                  /*
2364 2133                   * We only destroy the leaf ZAP for detached leaves or for
2365 2134                   * removed log devices. Removed data devices handle leaf ZAP
2366 2135                   * cleanup later, once cancellation is no longer possible.
2367 2136                   */
2368 2137                  if (vd->vdev_leaf_zap != 0 && (vd->vdev_detached ||
2369      -                    vd->vdev_top->vdev_islog)) {
     2138 +                    vd->vdev_top->vdev_islog || vd->vdev_top->vdev_isspecial)) {
2370 2139                          vdev_destroy_unlink_zap(vd, vd->vdev_leaf_zap, tx);
2371 2140                          vd->vdev_leaf_zap = 0;
2372 2141                  }
2373 2142  
2374 2143                  dmu_tx_commit(tx);
2375 2144                  return;
2376 2145          }
2377 2146  
2378 2147          if (vd->vdev_dtl_sm == NULL) {
2379 2148                  uint64_t new_object;
2380 2149  
2381 2150                  new_object = space_map_alloc(mos, tx);
2382 2151                  VERIFY3U(new_object, !=, 0);
2383 2152  
2384 2153                  VERIFY0(space_map_open(&vd->vdev_dtl_sm, mos, new_object,
2385      -                    0, -1ULL, 0));
     2154 +                    0, -1ULL, 0, &vd->vdev_dtl_lock));
2386 2155                  ASSERT(vd->vdev_dtl_sm != NULL);
2387 2156          }
2388 2157  
2389      -        rtsync = range_tree_create(NULL, NULL);
     2158 +        mutex_init(&rtlock, NULL, MUTEX_DEFAULT, NULL);
2390 2159  
     2160 +        rtsync = range_tree_create(NULL, NULL, &rtlock);
     2161 +
     2162 +        mutex_enter(&rtlock);
     2163 +
2391 2164          mutex_enter(&vd->vdev_dtl_lock);
2392 2165          range_tree_walk(rt, range_tree_add, rtsync);
2393 2166          mutex_exit(&vd->vdev_dtl_lock);
2394 2167  
2395 2168          space_map_truncate(vd->vdev_dtl_sm, tx);
2396 2169          space_map_write(vd->vdev_dtl_sm, rtsync, SM_ALLOC, tx);
2397 2170          range_tree_vacate(rtsync, NULL, NULL);
2398 2171  
2399 2172          range_tree_destroy(rtsync);
2400 2173  
     2174 +        mutex_exit(&rtlock);
     2175 +        mutex_destroy(&rtlock);
     2176 +
2401 2177          /*
2402 2178           * If the object for the space map has changed then dirty
2403 2179           * the top level so that we update the config.
2404 2180           */
2405 2181          if (object != space_map_object(vd->vdev_dtl_sm)) {
2406      -                vdev_dbgmsg(vd, "txg %llu, spa %s, DTL old object %llu, "
2407      -                    "new object %llu", (u_longlong_t)txg, spa_name(spa),
2408      -                    (u_longlong_t)object,
2409      -                    (u_longlong_t)space_map_object(vd->vdev_dtl_sm));
     2182 +                zfs_dbgmsg("txg %llu, spa %s, DTL old object %llu, "
     2183 +                    "new object %llu", txg, spa_name(spa), object,
     2184 +                    space_map_object(vd->vdev_dtl_sm));
2410 2185                  vdev_config_dirty(vd->vdev_top);
2411 2186          }
2412 2187  
2413 2188          dmu_tx_commit(tx);
2414 2189  
2415 2190          mutex_enter(&vd->vdev_dtl_lock);
2416 2191          space_map_update(vd->vdev_dtl_sm);
2417 2192          mutex_exit(&vd->vdev_dtl_lock);
2418 2193  }
2419 2194  
↓ open down ↓ 64 lines elided ↑ open up ↑
2484 2259                  }
2485 2260          }
2486 2261  
2487 2262          if (needed && minp) {
2488 2263                  *minp = thismin;
2489 2264                  *maxp = thismax;
2490 2265          }
2491 2266          return (needed);
2492 2267  }
2493 2268  
2494      -int
     2269 +void
2495 2270  vdev_load(vdev_t *vd)
2496 2271  {
2497      -        int error = 0;
2498 2272          /*
2499 2273           * Recursively load all children.
2500 2274           */
2501      -        for (int c = 0; c < vd->vdev_children; c++) {
2502      -                error = vdev_load(vd->vdev_child[c]);
2503      -                if (error != 0) {
2504      -                        return (error);
2505      -                }
2506      -        }
     2275 +        for (int c = 0; c < vd->vdev_children; c++)
     2276 +                vdev_load(vd->vdev_child[c]);
2507 2277  
2508      -        vdev_set_deflate_ratio(vd);
2509      -
2510 2278          /*
2511 2279           * If this is a top-level vdev, initialize its metaslabs.
2512 2280           */
2513      -        if (vd == vd->vdev_top && vdev_is_concrete(vd)) {
2514      -                if (vd->vdev_ashift == 0 || vd->vdev_asize == 0) {
2515      -                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
2516      -                            VDEV_AUX_CORRUPT_DATA);
2517      -                        vdev_dbgmsg(vd, "vdev_load: invalid size. ashift=%llu, "
2518      -                            "asize=%llu", (u_longlong_t)vd->vdev_ashift,
2519      -                            (u_longlong_t)vd->vdev_asize);
2520      -                        return (SET_ERROR(ENXIO));
2521      -                } else if ((error = vdev_metaslab_init(vd, 0)) != 0) {
2522      -                        vdev_dbgmsg(vd, "vdev_load: metaslab_init failed "
2523      -                            "[error=%d]", error);
2524      -                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
2525      -                            VDEV_AUX_CORRUPT_DATA);
2526      -                        return (error);
2527      -                }
2528      -        }
     2281 +        if (vd == vd->vdev_top && !vd->vdev_ishole &&
     2282 +            (vd->vdev_ashift == 0 || vd->vdev_asize == 0 ||
     2283 +            vdev_metaslab_init(vd, 0) != 0))
     2284 +                vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
     2285 +                    VDEV_AUX_CORRUPT_DATA);
2529 2286  
2530 2287          /*
2531 2288           * If this is a leaf vdev, load its DTL.
2532 2289           */
2533      -        if (vd->vdev_ops->vdev_op_leaf && (error = vdev_dtl_load(vd)) != 0) {
     2290 +        if (vd->vdev_ops->vdev_op_leaf && vdev_dtl_load(vd) != 0)
2534 2291                  vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
2535 2292                      VDEV_AUX_CORRUPT_DATA);
2536      -                vdev_dbgmsg(vd, "vdev_load: vdev_dtl_load failed "
2537      -                    "[error=%d]", error);
2538      -                return (error);
2539      -        }
2540      -
2541      -        uint64_t obsolete_sm_object = vdev_obsolete_sm_object(vd);
2542      -        if (obsolete_sm_object != 0) {
2543      -                objset_t *mos = vd->vdev_spa->spa_meta_objset;
2544      -                ASSERT(vd->vdev_asize != 0);
2545      -                ASSERT(vd->vdev_obsolete_sm == NULL);
2546      -
2547      -                if ((error = space_map_open(&vd->vdev_obsolete_sm, mos,
2548      -                    obsolete_sm_object, 0, vd->vdev_asize, 0))) {
2549      -                        vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
2550      -                            VDEV_AUX_CORRUPT_DATA);
2551      -                        vdev_dbgmsg(vd, "vdev_load: space_map_open failed for "
2552      -                            "obsolete spacemap (obj %llu) [error=%d]",
2553      -                            (u_longlong_t)obsolete_sm_object, error);
2554      -                        return (error);
2555      -                }
2556      -                space_map_update(vd->vdev_obsolete_sm);
2557      -        }
2558      -
2559      -        return (0);
2560 2293  }
2561 2294  
2562 2295  /*
2563 2296   * The special vdev case is used for hot spares and l2cache devices.  Its
2564 2297   * sole purpose it to set the vdev state for the associated vdev.  To do this,
2565 2298   * we make sure that we can open the underlying device, then try to read the
2566 2299   * label, and make sure that the label is sane and that it hasn't been
2567 2300   * repurposed to another pool.
2568 2301   */
2569 2302  int
↓ open down ↓ 24 lines elided ↑ open up ↑
2594 2327          }
2595 2328  
2596 2329          /*
2597 2330           * We don't actually check the pool state here.  If it's in fact in
2598 2331           * use by another pool, we update this fact on the fly when requested.
2599 2332           */
2600 2333          nvlist_free(label);
2601 2334          return (0);
2602 2335  }
2603 2336  
2604      -/*
2605      - * Free the objects used to store this vdev's spacemaps, and the array
2606      - * that points to them.
2607      - */
2608 2337  void
2609      -vdev_destroy_spacemaps(vdev_t *vd, dmu_tx_t *tx)
     2338 +vdev_remove(vdev_t *vd, uint64_t txg)
2610 2339  {
2611      -        if (vd->vdev_ms_array == 0)
2612      -                return;
2613      -
2614      -        objset_t *mos = vd->vdev_spa->spa_meta_objset;
2615      -        uint64_t array_count = vd->vdev_asize >> vd->vdev_ms_shift;
2616      -        size_t array_bytes = array_count * sizeof (uint64_t);
2617      -        uint64_t *smobj_array = kmem_alloc(array_bytes, KM_SLEEP);
2618      -        VERIFY0(dmu_read(mos, vd->vdev_ms_array, 0,
2619      -            array_bytes, smobj_array, 0));
2620      -
2621      -        for (uint64_t i = 0; i < array_count; i++) {
2622      -                uint64_t smobj = smobj_array[i];
2623      -                if (smobj == 0)
2624      -                        continue;
2625      -
2626      -                space_map_free_obj(mos, smobj, tx);
2627      -        }
2628      -
2629      -        kmem_free(smobj_array, array_bytes);
2630      -        VERIFY0(dmu_object_free(mos, vd->vdev_ms_array, tx));
2631      -        vd->vdev_ms_array = 0;
2632      -}
2633      -
2634      -static void
2635      -vdev_remove_empty(vdev_t *vd, uint64_t txg)
2636      -{
2637 2340          spa_t *spa = vd->vdev_spa;
     2341 +        objset_t *mos = spa->spa_meta_objset;
2638 2342          dmu_tx_t *tx;
2639 2343  
     2344 +        tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
2640 2345          ASSERT(vd == vd->vdev_top);
2641 2346          ASSERT3U(txg, ==, spa_syncing_txg(spa));
2642 2347  
2643 2348          if (vd->vdev_ms != NULL) {
2644 2349                  metaslab_group_t *mg = vd->vdev_mg;
2645 2350  
2646 2351                  metaslab_group_histogram_verify(mg);
2647 2352                  metaslab_class_histogram_verify(mg->mg_class);
2648 2353  
2649 2354                  for (int m = 0; m < vd->vdev_ms_count; m++) {
↓ open down ↓ 6 lines elided ↑ open up ↑
2656 2361                          /*
2657 2362                           * If the metaslab was not loaded when the vdev
2658 2363                           * was removed then the histogram accounting may
2659 2364                           * not be accurate. Update the histogram information
2660 2365                           * here so that we ensure that the metaslab group
2661 2366                           * and metaslab class are up-to-date.
2662 2367                           */
2663 2368                          metaslab_group_histogram_remove(mg, msp);
2664 2369  
2665 2370                          VERIFY0(space_map_allocated(msp->ms_sm));
     2371 +                        space_map_free(msp->ms_sm, tx);
2666 2372                          space_map_close(msp->ms_sm);
2667 2373                          msp->ms_sm = NULL;
2668 2374                          mutex_exit(&msp->ms_lock);
2669 2375                  }
2670 2376  
2671 2377                  metaslab_group_histogram_verify(mg);
2672 2378                  metaslab_class_histogram_verify(mg->mg_class);
2673 2379                  for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
2674 2380                          ASSERT0(mg->mg_histogram[i]);
     2381 +
2675 2382          }
2676 2383  
2677      -        tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
2678      -        vdev_destroy_spacemaps(vd, tx);
     2384 +        if (vd->vdev_ms_array) {
     2385 +                (void) dmu_object_free(mos, vd->vdev_ms_array, tx);
     2386 +                vd->vdev_ms_array = 0;
     2387 +        }
2679 2388  
2680      -        if (vd->vdev_islog && vd->vdev_top_zap != 0) {
     2389 +        if ((vd->vdev_islog || vd->vdev_isspecial) &&
     2390 +            vd->vdev_top_zap != 0) {
2681 2391                  vdev_destroy_unlink_zap(vd, vd->vdev_top_zap, tx);
2682 2392                  vd->vdev_top_zap = 0;
2683 2393          }
2684 2394          dmu_tx_commit(tx);
2685 2395  }
2686 2396  
2687 2397  void
2688 2398  vdev_sync_done(vdev_t *vd, uint64_t txg)
2689 2399  {
2690 2400          metaslab_t *msp;
2691 2401          boolean_t reassess = !txg_list_empty(&vd->vdev_ms_list, TXG_CLEAN(txg));
2692 2402  
2693      -        ASSERT(vdev_is_concrete(vd));
     2403 +        ASSERT(!vd->vdev_ishole);
2694 2404  
2695 2405          while (msp = txg_list_remove(&vd->vdev_ms_list, TXG_CLEAN(txg)))
2696 2406                  metaslab_sync_done(msp, txg);
2697 2407  
2698 2408          if (reassess)
2699 2409                  metaslab_sync_reassess(vd->vdev_mg);
2700 2410  }
2701 2411  
2702 2412  void
2703 2413  vdev_sync(vdev_t *vd, uint64_t txg)
2704 2414  {
2705 2415          spa_t *spa = vd->vdev_spa;
2706 2416          vdev_t *lvd;
2707 2417          metaslab_t *msp;
2708 2418          dmu_tx_t *tx;
2709 2419  
2710      -        if (range_tree_space(vd->vdev_obsolete_segments) > 0) {
2711      -                dmu_tx_t *tx;
     2420 +        ASSERT(!vd->vdev_ishole);
2712 2421  
2713      -                ASSERT(vd->vdev_removing ||
2714      -                    vd->vdev_ops == &vdev_indirect_ops);
2715      -
2716      -                tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
2717      -                vdev_indirect_sync_obsolete(vd, tx);
2718      -                dmu_tx_commit(tx);
2719      -
2720      -                /*
2721      -                 * If the vdev is indirect, it can't have dirty
2722      -                 * metaslabs or DTLs.
2723      -                 */
2724      -                if (vd->vdev_ops == &vdev_indirect_ops) {
2725      -                        ASSERT(txg_list_empty(&vd->vdev_ms_list, txg));
2726      -                        ASSERT(txg_list_empty(&vd->vdev_dtl_list, txg));
2727      -                        return;
2728      -                }
2729      -        }
2730      -
2731      -        ASSERT(vdev_is_concrete(vd));
2732      -
2733      -        if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0 &&
2734      -            !vd->vdev_removing) {
     2422 +        if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0) {
2735 2423                  ASSERT(vd == vd->vdev_top);
2736      -                ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
2737 2424                  tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
2738 2425                  vd->vdev_ms_array = dmu_object_alloc(spa->spa_meta_objset,
2739 2426                      DMU_OT_OBJECT_ARRAY, 0, DMU_OT_NONE, 0, tx);
2740 2427                  ASSERT(vd->vdev_ms_array != 0);
2741 2428                  vdev_config_dirty(vd);
2742 2429                  dmu_tx_commit(tx);
2743 2430          }
2744 2431  
     2432 +        /*
     2433 +         * Remove the metadata associated with this vdev once it's empty.
     2434 +         */
     2435 +        if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing)
     2436 +                vdev_remove(vd, txg);
     2437 +
2745 2438          while ((msp = txg_list_remove(&vd->vdev_ms_list, txg)) != NULL) {
2746 2439                  metaslab_sync(msp, txg);
2747 2440                  (void) txg_list_add(&vd->vdev_ms_list, msp, TXG_CLEAN(txg));
2748 2441          }
2749 2442  
2750 2443          while ((lvd = txg_list_remove(&vd->vdev_dtl_list, txg)) != NULL)
2751 2444                  vdev_dtl_sync(lvd, txg);
2752 2445  
2753      -        /*
2754      -         * Remove the metadata associated with this vdev once it's empty.
2755      -         * Note that this is typically used for log/cache device removal;
2756      -         * we don't empty toplevel vdevs when removing them.  But if
2757      -         * a toplevel happens to be emptied, this is not harmful.
2758      -         */
2759      -        if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing) {
2760      -                vdev_remove_empty(vd, txg);
2761      -        }
2762      -
2763 2446          (void) txg_list_add(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg));
2764 2447  }
2765 2448  
2766 2449  uint64_t
2767 2450  vdev_psize_to_asize(vdev_t *vd, uint64_t psize)
2768 2451  {
2769 2452          return (vd->vdev_ops->vdev_op_asize(vd, psize));
2770 2453  }
2771 2454  
2772 2455  /*
↓ open down ↓ 103 lines elided ↑ open up ↑
2876 2559          if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL)
2877 2560                  return (spa_vdev_state_exit(spa, NULL, ENODEV));
2878 2561  
2879 2562          if (!vd->vdev_ops->vdev_op_leaf)
2880 2563                  return (spa_vdev_state_exit(spa, NULL, ENOTSUP));
2881 2564  
2882 2565          wasoffline = (vd->vdev_offline || vd->vdev_tmpoffline);
2883 2566          oldstate = vd->vdev_state;
2884 2567  
2885 2568          tvd = vd->vdev_top;
2886      -        vd->vdev_offline = B_FALSE;
2887      -        vd->vdev_tmpoffline = B_FALSE;
     2569 +        vd->vdev_offline = 0ULL;
     2570 +        vd->vdev_tmpoffline = 0ULL;
2888 2571          vd->vdev_checkremove = !!(flags & ZFS_ONLINE_CHECKREMOVE);
2889 2572          vd->vdev_forcefault = !!(flags & ZFS_ONLINE_FORCEFAULT);
2890 2573  
2891 2574          /* XXX - L2ARC 1.0 does not support expansion */
2892 2575          if (!vd->vdev_aux) {
2893 2576                  for (pvd = vd; pvd != rvd; pvd = pvd->vdev_parent)
2894 2577                          pvd->vdev_expanding = !!(flags & ZFS_ONLINE_EXPAND);
2895 2578          }
2896 2579  
2897 2580          vdev_reopen(tvd);
↓ open down ↓ 68 lines elided ↑ open up ↑
2966 2649                   * is not NULL since it's possible that we may have just
2967 2650                   * added this vdev but not yet initialized its metaslabs.
2968 2651                   */
2969 2652                  if (tvd->vdev_islog && mg != NULL) {
2970 2653                          /*
2971 2654                           * Prevent any future allocations.
2972 2655                           */
2973 2656                          metaslab_group_passivate(mg);
2974 2657                          (void) spa_vdev_state_exit(spa, vd, 0);
2975 2658  
2976      -                        error = spa_reset_logs(spa);
     2659 +                        error = spa_offline_log(spa);
2977 2660  
2978 2661                          spa_vdev_state_enter(spa, SCL_ALLOC);
2979 2662  
2980 2663                          /*
2981 2664                           * Check to see if the config has changed.
2982 2665                           */
2983 2666                          if (error || generation != spa->spa_config_generation) {
2984 2667                                  metaslab_group_activate(mg);
2985 2668                                  if (error)
2986 2669                                          return (spa_vdev_state_exit(spa,
↓ open down ↓ 46 lines elided ↑ open up ↑
3033 2716  }
3034 2717  
3035 2718  /*
3036 2719   * Clear the error counts associated with this vdev.  Unlike vdev_online() and
3037 2720   * vdev_offline(), we assume the spa config is locked.  We also clear all
3038 2721   * children.  If 'vd' is NULL, then the user wants to clear all vdevs.
3039 2722   */
3040 2723  void
3041 2724  vdev_clear(spa_t *spa, vdev_t *vd)
3042 2725  {
     2726 +        int c;
3043 2727          vdev_t *rvd = spa->spa_root_vdev;
3044 2728  
3045 2729          ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
3046 2730  
3047      -        if (vd == NULL)
     2731 +        if (vd == NULL) {
3048 2732                  vd = rvd;
3049 2733  
     2734 +                /* Go through spare and l2cache vdevs */
     2735 +                for (c = 0; c < spa->spa_spares.sav_count; c++)
     2736 +                        vdev_clear(spa, spa->spa_spares.sav_vdevs[c]);
     2737 +                for (c = 0; c < spa->spa_l2cache.sav_count; c++)
     2738 +                        vdev_clear(spa, spa->spa_l2cache.sav_vdevs[c]);
     2739 +        }
     2740 +
3050 2741          vd->vdev_stat.vs_read_errors = 0;
3051 2742          vd->vdev_stat.vs_write_errors = 0;
3052 2743          vd->vdev_stat.vs_checksum_errors = 0;
3053 2744  
3054      -        for (int c = 0; c < vd->vdev_children; c++)
3055      -                vdev_clear(spa, vd->vdev_child[c]);
3056      -
3057 2745          /*
3058      -         * It makes no sense to "clear" an indirect vdev.
     2746 +         * If all disk vdevs failed at the same time (e.g. due to a
     2747 +         * disconnected cable), that suspends I/O activity to the pool,
     2748 +         * which stalls spa_sync if there happened to be any dirty data.
     2749 +         * As a consequence, this flag might not be cleared, because it
     2750 +         * is only lowered by spa_async_remove (which cannot run). This
     2751 +         * then prevents zio_resume from succeeding even if vdev reopen
     2752 +         * succeeds, leading to an indefinitely suspended pool. So we
     2753 +         * lower the flag here to allow zio_resume to succeed, provided
     2754 +         * reopening of the vdevs succeeds.
3059 2755           */
3060      -        if (!vdev_is_concrete(vd))
3061      -                return;
     2756 +        vd->vdev_remove_wanted = B_FALSE;
3062 2757  
     2758 +        for (c = 0; c < vd->vdev_children; c++)
     2759 +                vdev_clear(spa, vd->vdev_child[c]);
     2760 +
3063 2761          /*
3064 2762           * If we're in the FAULTED state or have experienced failed I/O, then
3065 2763           * clear the persistent state and attempt to reopen the device.  We
3066 2764           * also mark the vdev config dirty, so that the new faulted state is
3067 2765           * written out to disk.
3068 2766           */
3069 2767          if (vd->vdev_faulted || vd->vdev_degraded ||
3070 2768              !vdev_readable(vd) || !vdev_writeable(vd)) {
3071 2769  
3072 2770                  /*
↓ open down ↓ 34 lines elided ↑ open up ↑
3107 2805  boolean_t
3108 2806  vdev_is_dead(vdev_t *vd)
3109 2807  {
3110 2808          /*
3111 2809           * Holes and missing devices are always considered "dead".
3112 2810           * This simplifies the code since we don't have to check for
3113 2811           * these types of devices in the various code paths.
3114 2812           * Instead we rely on the fact that we skip over dead devices
3115 2813           * before issuing I/O to them.
3116 2814           */
3117      -        return (vd->vdev_state < VDEV_STATE_DEGRADED ||
3118      -            vd->vdev_ops == &vdev_hole_ops ||
     2815 +        return (vd->vdev_state < VDEV_STATE_DEGRADED || vd->vdev_ishole ||
3119 2816              vd->vdev_ops == &vdev_missing_ops);
3120 2817  }
3121 2818  
3122 2819  boolean_t
3123 2820  vdev_readable(vdev_t *vd)
3124 2821  {
3125      -        return (!vdev_is_dead(vd) && !vd->vdev_cant_read);
     2822 +        return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_read);
3126 2823  }
3127 2824  
3128 2825  boolean_t
3129 2826  vdev_writeable(vdev_t *vd)
3130 2827  {
3131      -        return (!vdev_is_dead(vd) && !vd->vdev_cant_write &&
3132      -            vdev_is_concrete(vd));
     2828 +        return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_write);
3133 2829  }
3134 2830  
3135 2831  boolean_t
3136 2832  vdev_allocatable(vdev_t *vd)
3137 2833  {
3138 2834          uint64_t state = vd->vdev_state;
3139 2835  
3140 2836          /*
3141 2837           * We currently allow allocations from vdevs which may be in the
3142 2838           * process of reopening (i.e. VDEV_STATE_CLOSED). If the device
3143 2839           * fails to reopen then we'll catch it later when we're holding
3144 2840           * the proper locks.  Note that we have to get the vdev state
3145 2841           * in a local variable because although it changes atomically,
3146 2842           * we're asking two separate questions about it.
3147 2843           */
3148 2844          return (!(state < VDEV_STATE_DEGRADED && state != VDEV_STATE_CLOSED) &&
3149      -            !vd->vdev_cant_write && vdev_is_concrete(vd) &&
     2845 +            !vd->vdev_cant_write && !vd->vdev_ishole &&
3150 2846              vd->vdev_mg->mg_initialized);
3151 2847  }
3152 2848  
3153 2849  boolean_t
3154 2850  vdev_accessible(vdev_t *vd, zio_t *zio)
3155 2851  {
3156 2852          ASSERT(zio->io_vd == vd);
3157 2853  
3158 2854          if (vdev_is_dead(vd) || vd->vdev_remove_wanted)
3159 2855                  return (B_FALSE);
↓ open down ↓ 28 lines elided ↑ open up ↑
3188 2884                  vs->vs_rsize += VDEV_LABEL_START_SIZE + VDEV_LABEL_END_SIZE;
3189 2885          /*
3190 2886           * Report expandable space on top-level, non-auxillary devices only.
3191 2887           * The expandable space is reported in terms of metaslab sized units
3192 2888           * since that determines how much space the pool can expand.
3193 2889           */
3194 2890          if (vd->vdev_aux == NULL && tvd != NULL) {
3195 2891                  vs->vs_esize = P2ALIGN(vd->vdev_max_asize - vd->vdev_asize -
3196 2892                      spa->spa_bootsize, 1ULL << tvd->vdev_ms_shift);
3197 2893          }
3198      -        if (vd->vdev_aux == NULL && vd == vd->vdev_top &&
3199      -            vdev_is_concrete(vd)) {
     2894 +        if (vd->vdev_aux == NULL && vd == vd->vdev_top && !vd->vdev_ishole) {
3200 2895                  vs->vs_fragmentation = vd->vdev_mg->mg_fragmentation;
3201 2896          }
3202 2897  
3203 2898          /*
3204 2899           * If we're getting stats on the root vdev, aggregate the I/O counts
3205 2900           * over all top-level vdevs (i.e. the direct children of the root).
3206 2901           */
3207 2902          if (vd == rvd) {
3208 2903                  for (int c = 0; c < rvd->vdev_children; c++) {
3209 2904                          vdev_t *cvd = rvd->vdev_child[c];
3210 2905                          vdev_stat_t *cvs = &cvd->vdev_stat;
3211 2906  
3212 2907                          for (int t = 0; t < ZIO_TYPES; t++) {
3213 2908                                  vs->vs_ops[t] += cvs->vs_ops[t];
3214 2909                                  vs->vs_bytes[t] += cvs->vs_bytes[t];
     2910 +                                vs->vs_iotime[t] += cvs->vs_iotime[t];
     2911 +                                vs->vs_latency[t] += cvs->vs_latency[t];
3215 2912                          }
3216 2913                          cvs->vs_scan_removing = cvd->vdev_removing;
3217 2914                  }
3218 2915          }
3219 2916          mutex_exit(&vd->vdev_stat_lock);
3220 2917  }
3221 2918  
3222 2919  void
3223 2920  vdev_clear_stats(vdev_t *vd)
3224 2921  {
↓ open down ↓ 72 lines elided ↑ open up ↑
3297 2994                                  vs->vs_scan_processed += psize;
3298 2995                          }
3299 2996  
3300 2997                          if (flags & ZIO_FLAG_SELF_HEAL)
3301 2998                                  vs->vs_self_healed += psize;
3302 2999                  }
3303 3000  
3304 3001                  vs->vs_ops[type]++;
3305 3002                  vs->vs_bytes[type] += psize;
3306 3003  
     3004 +                /*
     3005 +                 * While measuring each delta in nanoseconds, we should keep
     3006 +                 * cumulative iotime in microseconds so it doesn't overflow on
     3007 +                 * a busy system.
     3008 +                 */
     3009 +                vs->vs_iotime[type] += (zio->io_vd_timestamp) / 1000;
     3010 +
     3011 +                /*
     3012 +                 * Latency is an exponential moving average of iotime deltas
     3013 +                 * with tuneable alpha measured in 1/10th of percent.
     3014 +                 */
     3015 +                vs->vs_latency[type] += ((int64_t)zio->io_vd_timestamp -
     3016 +                    vs->vs_latency[type]) * zfs_vs_latency_alpha / 1000;
     3017 +
3307 3018                  mutex_exit(&vd->vdev_stat_lock);
3308 3019                  return;
3309 3020          }
3310 3021  
3311 3022          if (flags & ZIO_FLAG_SPECULATIVE)
3312 3023                  return;
3313 3024  
3314 3025          /*
3315 3026           * If this is an I/O error that is going to be retried, then ignore the
3316 3027           * error.  Otherwise, the user may interpret B_FAILFAST I/O errors as
↓ open down ↓ 16 lines elided ↑ open up ↑
3333 3044          if (type == ZIO_TYPE_READ && !vdev_is_dead(vd)) {
3334 3045                  if (zio->io_error == ECKSUM)
3335 3046                          vs->vs_checksum_errors++;
3336 3047                  else
3337 3048                          vs->vs_read_errors++;
3338 3049          }
3339 3050          if (type == ZIO_TYPE_WRITE && !vdev_is_dead(vd))
3340 3051                  vs->vs_write_errors++;
3341 3052          mutex_exit(&vd->vdev_stat_lock);
3342 3053  
3343      -        if (spa->spa_load_state == SPA_LOAD_NONE &&
3344      -            type == ZIO_TYPE_WRITE && txg != 0 &&
     3054 +        if ((vd->vdev_isspecial || vd->vdev_isspecial_child) &&
     3055 +            (vs->vs_checksum_errors != 0 || vs->vs_read_errors != 0 ||
     3056 +            vs->vs_write_errors != 0 || !vdev_readable(vd) ||
     3057 +            !vdev_writeable(vd)) && !spa->spa_special_has_errors) {
     3058 +                /* all new writes will be placed on normal */
     3059 +                cmn_err(CE_WARN, "New writes to special vdev [%s] "
     3060 +                    "will be stopped", (vd->vdev_path != NULL) ?
     3061 +                    vd->vdev_path : "undefined");
     3062 +                spa->spa_special_has_errors = B_TRUE;
     3063 +        }
     3064 +
     3065 +        if (type == ZIO_TYPE_WRITE && txg != 0 &&
3345 3066              (!(flags & ZIO_FLAG_IO_REPAIR) ||
3346 3067              (flags & ZIO_FLAG_SCAN_THREAD) ||
3347 3068              spa->spa_claiming)) {
3348 3069                  /*
3349 3070                   * This is either a normal write (not a repair), or it's
3350 3071                   * a repair induced by the scrub thread, or it's a repair
3351 3072                   * made by zil_claim() during spa_load() in the first txg.
3352 3073                   * In the normal case, we commit the DTL change in the same
3353 3074                   * txg as the block was born.  In the scrub-induced repair
3354 3075                   * case, we know that scrubs run in first-pass syncing context,
↓ open down ↓ 54 lines elided ↑ open up ↑
3409 3130          ASSERT(vd->vdev_deflate_ratio != 0 || vd->vdev_isl2cache);
3410 3131          dspace_delta = (dspace_delta >> SPA_MINBLOCKSHIFT) *
3411 3132              vd->vdev_deflate_ratio;
3412 3133  
3413 3134          mutex_enter(&vd->vdev_stat_lock);
3414 3135          vd->vdev_stat.vs_alloc += alloc_delta;
3415 3136          vd->vdev_stat.vs_space += space_delta;
3416 3137          vd->vdev_stat.vs_dspace += dspace_delta;
3417 3138          mutex_exit(&vd->vdev_stat_lock);
3418 3139  
3419      -        if (mc == spa_normal_class(spa)) {
     3140 +        if (mc == spa_normal_class(spa) || mc == spa_special_class(spa)) {
3420 3141                  mutex_enter(&rvd->vdev_stat_lock);
3421 3142                  rvd->vdev_stat.vs_alloc += alloc_delta;
3422 3143                  rvd->vdev_stat.vs_space += space_delta;
3423 3144                  rvd->vdev_stat.vs_dspace += dspace_delta;
3424 3145                  mutex_exit(&rvd->vdev_stat_lock);
3425 3146          }
3426 3147  
3427 3148          if (mc != NULL) {
3428 3149                  ASSERT(rvd == vd->vdev_parent);
3429 3150                  ASSERT(vd->vdev_ms_count != 0);
↓ open down ↓ 69 lines elided ↑ open up ↑
3499 3220              (dsl_pool_sync_context(spa_get_dsl(spa)) &&
3500 3221              spa_config_held(spa, SCL_CONFIG, RW_READER)));
3501 3222  
3502 3223          if (vd == rvd) {
3503 3224                  for (c = 0; c < rvd->vdev_children; c++)
3504 3225                          vdev_config_dirty(rvd->vdev_child[c]);
3505 3226          } else {
3506 3227                  ASSERT(vd == vd->vdev_top);
3507 3228  
3508 3229                  if (!list_link_active(&vd->vdev_config_dirty_node) &&
3509      -                    vdev_is_concrete(vd)) {
     3230 +                    !vd->vdev_ishole)
3510 3231                          list_insert_head(&spa->spa_config_dirty_list, vd);
3511      -                }
3512 3232          }
3513 3233  }
3514 3234  
3515 3235  void
3516 3236  vdev_config_clean(vdev_t *vd)
3517 3237  {
3518 3238          spa_t *spa = vd->vdev_spa;
3519 3239  
3520 3240          ASSERT(spa_config_held(spa, SCL_CONFIG, RW_WRITER) ||
3521 3241              (dsl_pool_sync_context(spa_get_dsl(spa)) &&
↓ open down ↓ 20 lines elided ↑ open up ↑
3542 3262          /*
3543 3263           * The state list is protected by the SCL_STATE lock.  The caller
3544 3264           * must either hold SCL_STATE as writer, or must be the sync thread
3545 3265           * (which holds SCL_STATE as reader).  There's only one sync thread,
3546 3266           * so this is sufficient to ensure mutual exclusion.
3547 3267           */
3548 3268          ASSERT(spa_config_held(spa, SCL_STATE, RW_WRITER) ||
3549 3269              (dsl_pool_sync_context(spa_get_dsl(spa)) &&
3550 3270              spa_config_held(spa, SCL_STATE, RW_READER)));
3551 3271  
3552      -        if (!list_link_active(&vd->vdev_state_dirty_node) &&
3553      -            vdev_is_concrete(vd))
     3272 +        if (!list_link_active(&vd->vdev_state_dirty_node) && !vd->vdev_ishole)
3554 3273                  list_insert_head(&spa->spa_state_dirty_list, vd);
3555 3274  }
3556 3275  
3557 3276  void
3558 3277  vdev_state_clean(vdev_t *vd)
3559 3278  {
3560 3279          spa_t *spa = vd->vdev_spa;
3561 3280  
3562 3281          ASSERT(spa_config_held(spa, SCL_STATE, RW_WRITER) ||
3563 3282              (dsl_pool_sync_context(spa_get_dsl(spa)) &&
↓ open down ↓ 13 lines elided ↑ open up ↑
3577 3296          vdev_t *rvd = spa->spa_root_vdev;
3578 3297          int degraded = 0, faulted = 0;
3579 3298          int corrupted = 0;
3580 3299          vdev_t *child;
3581 3300  
3582 3301          if (vd->vdev_children > 0) {
3583 3302                  for (int c = 0; c < vd->vdev_children; c++) {
3584 3303                          child = vd->vdev_child[c];
3585 3304  
3586 3305                          /*
3587      -                         * Don't factor holes or indirect vdevs into the
3588      -                         * decision.
     3306 +                         * Don't factor holes into the decision.
3589 3307                           */
3590      -                        if (!vdev_is_concrete(child))
     3308 +                        if (child->vdev_ishole)
3591 3309                                  continue;
3592 3310  
3593 3311                          if (!vdev_readable(child) ||
3594 3312                              (!vdev_writeable(child) && spa_writeable(spa))) {
3595 3313                                  /*
3596 3314                                   * Root special: if there is a top-level log
3597 3315                                   * device, treat the root vdev as if it were
3598 3316                                   * degraded.
3599 3317                                   */
3600 3318                                  if (child->vdev_islog && vd == rvd)
↓ open down ↓ 154 lines elided ↑ open up ↑
3755 3473                  /* Erase any notion of persistent removed state */
3756 3474                  vd->vdev_removed = B_FALSE;
3757 3475          } else {
3758 3476                  vd->vdev_removed = B_FALSE;
3759 3477          }
3760 3478  
3761 3479          if (!isopen && vd->vdev_parent)
3762 3480                  vdev_propagate_state(vd->vdev_parent);
3763 3481  }
3764 3482  
3765      -boolean_t
3766      -vdev_children_are_offline(vdev_t *vd)
3767      -{
3768      -        ASSERT(!vd->vdev_ops->vdev_op_leaf);
3769      -
3770      -        for (uint64_t i = 0; i < vd->vdev_children; i++) {
3771      -                if (vd->vdev_child[i]->vdev_state != VDEV_STATE_OFFLINE)
3772      -                        return (B_FALSE);
3773      -        }
3774      -
3775      -        return (B_TRUE);
3776      -}
3777      -
3778 3483  /*
3779 3484   * Check the vdev configuration to ensure that it's capable of supporting
3780 3485   * a root pool. We do not support partial configuration.
3781 3486   * In addition, only a single top-level vdev is allowed.
3782 3487   */
3783 3488  boolean_t
3784 3489  vdev_is_bootable(vdev_t *vd)
3785 3490  {
3786 3491          if (!vd->vdev_ops->vdev_op_leaf) {
3787 3492                  char *vdev_type = vd->vdev_ops->vdev_op_type;
3788 3493  
3789 3494                  if (strcmp(vdev_type, VDEV_TYPE_ROOT) == 0 &&
3790 3495                      vd->vdev_children > 1) {
3791 3496                          return (B_FALSE);
3792      -                } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0 ||
3793      -                    strcmp(vdev_type, VDEV_TYPE_INDIRECT) == 0) {
     3497 +                } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0) {
3794 3498                          return (B_FALSE);
3795 3499                  }
3796 3500          }
3797 3501  
3798 3502          for (int c = 0; c < vd->vdev_children; c++) {
3799 3503                  if (!vdev_is_bootable(vd->vdev_child[c]))
3800 3504                          return (B_FALSE);
3801 3505          }
3802 3506          return (B_TRUE);
3803 3507  }
3804 3508  
3805      -boolean_t
3806      -vdev_is_concrete(vdev_t *vd)
     3509 +/*
     3510 + * Load the state from the original vdev tree (ovd) which
     3511 + * we've retrieved from the MOS config object. If the original
     3512 + * vdev was offline or faulted then we transfer that state to the
     3513 + * device in the current vdev tree (nvd).
     3514 + */
     3515 +void
     3516 +vdev_load_log_state(vdev_t *nvd, vdev_t *ovd)
3807 3517  {
3808      -        vdev_ops_t *ops = vd->vdev_ops;
3809      -        if (ops == &vdev_indirect_ops || ops == &vdev_hole_ops ||
3810      -            ops == &vdev_missing_ops || ops == &vdev_root_ops) {
3811      -                return (B_FALSE);
3812      -        } else {
3813      -                return (B_TRUE);
     3518 +        spa_t *spa = nvd->vdev_spa;
     3519 +
     3520 +        ASSERT(nvd->vdev_top->vdev_islog);
     3521 +        ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
     3522 +        ASSERT3U(nvd->vdev_guid, ==, ovd->vdev_guid);
     3523 +
     3524 +        for (int c = 0; c < nvd->vdev_children; c++)
     3525 +                vdev_load_log_state(nvd->vdev_child[c], ovd->vdev_child[c]);
     3526 +
     3527 +        if (nvd->vdev_ops->vdev_op_leaf) {
     3528 +                /*
     3529 +                 * Restore the persistent vdev state
     3530 +                 */
     3531 +                nvd->vdev_offline = ovd->vdev_offline;
     3532 +                nvd->vdev_faulted = ovd->vdev_faulted;
     3533 +                nvd->vdev_degraded = ovd->vdev_degraded;
     3534 +                nvd->vdev_removed = ovd->vdev_removed;
3814 3535          }
3815 3536  }
3816 3537  
3817 3538  /*
3818 3539   * Determine if a log device has valid content.  If the vdev was
3819 3540   * removed or faulted in the MOS config then we know that
3820 3541   * the content on the log device has already been written to the pool.
3821 3542   */
3822 3543  boolean_t
3823 3544  vdev_log_state_valid(vdev_t *vd)
↓ open down ↓ 11 lines elided ↑ open up ↑
3835 3556  
3836 3557  /*
3837 3558   * Expand a vdev if possible.
3838 3559   */
3839 3560  void
3840 3561  vdev_expand(vdev_t *vd, uint64_t txg)
3841 3562  {
3842 3563          ASSERT(vd->vdev_top == vd);
3843 3564          ASSERT(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER) == SCL_ALL);
3844 3565  
3845      -        vdev_set_deflate_ratio(vd);
3846      -
3847      -        if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count &&
3848      -            vdev_is_concrete(vd)) {
     3566 +        if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count) {
3849 3567                  VERIFY(vdev_metaslab_init(vd, txg) == 0);
3850 3568                  vdev_config_dirty(vd);
3851 3569          }
3852 3570  }
3853 3571  
3854 3572  /*
3855 3573   * Split a vdev.
3856 3574   */
3857 3575  void
3858 3576  vdev_split(vdev_t *vd)
↓ open down ↓ 30 lines elided ↑ open up ↑
3889 3607                          uint64_t delta;
3890 3608  
3891 3609                          /*
3892 3610                           * Look at the head of all the pending queues,
3893 3611                           * if any I/O has been outstanding for longer than
3894 3612                           * the spa_deadman_synctime we panic the system.
3895 3613                           */
3896 3614                          fio = avl_first(&vq->vq_active_tree);
3897 3615                          delta = gethrtime() - fio->io_timestamp;
3898 3616                          if (delta > spa_deadman_synctime(spa)) {
3899      -                                vdev_dbgmsg(vd, "SLOW IO: zio timestamp "
3900      -                                    "%lluns, delta %lluns, last io %lluns",
3901      -                                    fio->io_timestamp, (u_longlong_t)delta,
     3617 +                                zfs_dbgmsg("SLOW IO: zio timestamp %lluns, "
     3618 +                                    "delta %lluns, last io %lluns",
     3619 +                                    fio->io_timestamp, delta,
3902 3620                                      vq->vq_io_complete_ts);
3903 3621                                  fm_panic("I/O to pool '%s' appears to be "
3904 3622                                      "hung.", spa_name(spa));
3905 3623                          }
3906 3624                  }
3907 3625                  mutex_exit(&vq->vq_lock);
3908 3626          }
     3627 +}
     3628 +
     3629 +boolean_t
     3630 +vdev_type_is_ddt(vdev_t *vd)
     3631 +{
     3632 +        uint64_t pool;
     3633 +
     3634 +        if (vd->vdev_l2ad_ddt == 1 &&
     3635 +            zfs_ddt_limit_type == DDT_LIMIT_TO_L2ARC) {
     3636 +                ASSERT(spa_l2cache_exists(vd->vdev_guid, &pool));
     3637 +                ASSERT(vd->vdev_isl2cache);
     3638 +                return (B_TRUE);
     3639 +        }
     3640 +        return (B_FALSE);
     3641 +}
     3642 +
     3643 +/* count leaf vdev(s) under the given vdev */
     3644 +uint_t
     3645 +vdev_count_leaf_vdevs(vdev_t *vd)
     3646 +{
     3647 +        uint_t cnt = 0;
     3648 +
     3649 +        if (vd->vdev_ops->vdev_op_leaf)
     3650 +                return (1);
     3651 +
     3652 +        /* if this is not a leaf vdev - visit children */
     3653 +        for (int c = 0; c < vd->vdev_children; c++)
     3654 +                cnt += vdev_count_leaf_vdevs(vd->vdev_child[c]);
     3655 +
     3656 +        return (cnt);
     3657 +}
     3658 +
     3659 +/*
     3660 + * Implements the per-vdev portion of manual TRIM. The function passes over
     3661 + * all metaslabs on this vdev and performs a metaslab_trim_all on them. It's
     3662 + * also responsible for rate-control if spa_man_trim_rate is non-zero.
     3663 + */
     3664 +void
     3665 +vdev_man_trim(vdev_trim_info_t *vti)
     3666 +{
     3667 +        clock_t t = ddi_get_lbolt();
     3668 +        spa_t *spa = vti->vti_vdev->vdev_spa;
     3669 +        vdev_t *vd = vti->vti_vdev;
     3670 +
     3671 +        vd->vdev_man_trimming = B_TRUE;
     3672 +        vd->vdev_trim_prog = 0;
     3673 +
     3674 +        spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
     3675 +        for (uint64_t i = 0; i < vti->vti_vdev->vdev_ms_count &&
     3676 +            !spa->spa_man_trim_stop; i++) {
     3677 +                uint64_t delta;
     3678 +                metaslab_t *msp = vd->vdev_ms[i];
     3679 +                zio_t *trim_io = metaslab_trim_all(msp, &delta);
     3680 +
     3681 +                atomic_add_64(&vd->vdev_trim_prog, msp->ms_size);
     3682 +                spa_config_exit(spa, SCL_STATE_ALL, FTAG);
     3683 +
     3684 +                (void) zio_wait(trim_io);
     3685 +
     3686 +                /* delay loop to handle fixed-rate trimming */
     3687 +                for (;;) {
     3688 +                        uint64_t rate = spa->spa_man_trim_rate;
     3689 +                        uint64_t sleep_delay;
     3690 +
     3691 +                        if (rate == 0) {
     3692 +                                /* No delay, just update 't' and move on. */
     3693 +                                t = ddi_get_lbolt();
     3694 +                                break;
     3695 +                        }
     3696 +
     3697 +                        sleep_delay = (delta * hz) / rate;
     3698 +                        mutex_enter(&spa->spa_man_trim_lock);
     3699 +                        (void) cv_timedwait(&spa->spa_man_trim_update_cv,
     3700 +                            &spa->spa_man_trim_lock, t);
     3701 +                        mutex_exit(&spa->spa_man_trim_lock);
     3702 +
     3703 +                        /* If interrupted, don't try to relock, get out */
     3704 +                        if (spa->spa_man_trim_stop)
     3705 +                                goto out;
     3706 +
     3707 +                        /* Timeout passed, move on to the next metaslab. */
     3708 +                        if (ddi_get_lbolt() >= t + sleep_delay) {
     3709 +                                t += sleep_delay;
     3710 +                                break;
     3711 +                        }
     3712 +                }
     3713 +                spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
     3714 +        }
     3715 +        spa_config_exit(spa, SCL_STATE_ALL, FTAG);
     3716 +out:
     3717 +        vd->vdev_man_trimming = B_FALSE;
     3718 +        /*
     3719 +         * Ensure we're marked as "completed" even if we've had to stop
     3720 +         * before processing all metaslabs.
     3721 +         */
     3722 +        vd->vdev_trim_prog = vd->vdev_asize;
     3723 +
     3724 +        ASSERT(vti->vti_done_cb != NULL);
     3725 +        vti->vti_done_cb(vti->vti_done_arg);
     3726 +
     3727 +        kmem_free(vti, sizeof (*vti));
     3728 +}
     3729 +
     3730 +/*
     3731 + * Runs through all metaslabs on the vdev and does their autotrim processing.
     3732 + */
     3733 +void
     3734 +vdev_auto_trim(vdev_trim_info_t *vti)
     3735 +{
     3736 +        vdev_t *vd = vti->vti_vdev;
     3737 +        spa_t *spa = vd->vdev_spa;
     3738 +        uint64_t txg = vti->vti_txg;
     3739 +
     3740 +        if (vd->vdev_man_trimming)
     3741 +                goto out;
     3742 +
     3743 +        spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
     3744 +        for (uint64_t i = 0; i < vd->vdev_ms_count; i++)
     3745 +                metaslab_auto_trim(vd->vdev_ms[i], txg);
     3746 +        spa_config_exit(spa, SCL_STATE_ALL, FTAG);
     3747 +out:
     3748 +        ASSERT(vti->vti_done_cb != NULL);
     3749 +        vti->vti_done_cb(vti->vti_done_arg);
     3750 +
     3751 +        kmem_free(vti, sizeof (*vti));
3909 3752  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX