Print this page
9700 ZFS resilvered mirror does not balance reads
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
NEX-17931 Getting panic: vfs_mountroot: cannot mount root after split mirror syspool
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9989 Changing volume names can result in double imports and data corruption
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-6855 System fails to boot up after a large number of datasets created
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-8711 backport illumos 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-7550 zpool remove mirrored slog or special vdev causes system panic due to a NULL pointer dereference in "zfs" module
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6884 KRRP: replication deadlock due to unavailable resources
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6000 zpool destroy/export with autotrim=on panics due to lock assertion
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5702 Special vdev cannot be removed if it was used as slog
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5637 enablespecial property should be disabled after special vdev removal
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5068 In-progress scrub can drastically increase zpool import times
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-5219 WBC: Add capability to delay migration
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5078 Want ability to see progress of freeing data and how much is left to free after large file delete patch
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5019 wrcache activation races vs. 'zpool create -O wrc_mode='
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-4934 Add capability to remove special vdev
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4830 writecache=off leaks data on special vdev (the data will never migrate)
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4876 On-demand TRIM shouldn't use system_taskq and should queue jobs
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4679 Autotrim taskq doesn't get destroyed on pool export
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
NEX-4567 KRRP: L2L replication inside of one pool causes ARC-deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
6529 Properly handle updates of variably-sized SA entries.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6175 sdev can create bogus zvol directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6174 /dev/zvol does not show pool directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
6046 SPARC boot should support com.delphix:hole_birth
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6041 SPARC boot should support LZ4
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6044 SPARC zfs reader is using wrong size for objset_phys
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
backout 5997: breaks "zpool add"
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
5770 Add load_nvlist() error handling
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Revert "NEX-4476 WRC: Allow to use write back cache per tree of datasets"
This reverts commit fe97b74444278a6f36fec93179133641296312da.
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-4077 taskq_dispatch in on-demand TRIM can sometimes fail
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Revert "NEX-3965 System may panic on the importing of pool with WRC"
This reverts commit 45bc50222913cddafde94621d28b78d6efaea897.
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3817 'zpool add' of special devices causes system panic
 Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3213 need to load vdev props for all vdev including spares and l2arc vdevs
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-2112 `zdb -e <pool>` assertion failed for thread 0xfffffd7fff172a40
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-1228 Panic importing pool with active unsupported features
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Harold Shaw <harold.shaw@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-140 Duplicate entries in mantools and doctools manifests
NEX-1078 Replaced ASSERT with if-statement
NEX-521 Single threaded rpcbind is not scalable
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
NEX-1088 partially rolled back 641841bb
to fix regression that caused assert in read-only import.
OS-115 Heap leaks related to OS-114 and SUP-577
SUP-577 deadlock between zpool detach and syseventd
OS-103 handle CoS descriptor persistent references across vdev operations
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Make special vdev subtree topology the same as regular vdev subtree to simplify testcase setup
Fixup merge issues
Fix default properties' values after export/import
zfsxx issue #11: support for spare device groups
Issue #34: Add feature flag for the compount checksum - sha1crc32
           Contributors: Boris Protopopov
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
Issue #27: Auto best-effort dedup enable/disable - settable per pool
Issues #7: Reconsile L2ARC and "special" use by datasets
Issue #9: Support for persistent CoS/vdev attributes with feature flags
          Support for feature flags for special tier
          Contributors: Daniil Lunev, Boris Protopopov
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
Issue #3: Add support for parametrized number of copies for DDTs
Issue #25: Add a pool-level property that controls the number of copies of DDTs in the pool.
Fixup merge results
re #13850 Refactor ZFS config discovery IOCs to libzfs_core patterns
re 13748 added zpool export -c option
zpool export -c command exports specified pool while keeping its latest
configuration in the cache file for subsequent zpool import -c.
re #13333 rb4362 - eliminated spa_update_iotime() to fix the stats
re #12684 rb4206 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/spa.c
          +++ new/usr/src/uts/common/fs/zfs/spa.c
↓ open down ↓ 13 lines elided ↑ open up ↑
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24      - * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
  25      - * Copyright (c) 2015, Nexenta Systems, Inc.  All rights reserved.
       24 + * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  26   25   * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
       26 + * Copyright 2018 Nexenta Systems, Inc.  All rights reserved.
  27   27   * Copyright 2013 Saso Kiselkov. All rights reserved.
  28   28   * Copyright (c) 2014 Integros [integros.com]
  29   29   * Copyright 2016 Toomas Soome <tsoome@me.com>
  30      - * Copyright 2017 Joyent, Inc.
       30 + * Copyright 2018 Joyent, Inc.
  31   31   * Copyright (c) 2017 Datto Inc.
  32      - * Copyright 2018 OmniOS Community Edition (OmniOSce) Association.
  33   32   */
  34   33  
  35   34  /*
  36   35   * SPA: Storage Pool Allocator
  37   36   *
  38   37   * This file contains all the routines used when modifying on-disk SPA state.
  39   38   * This includes opening, importing, destroying, exporting a pool, and syncing a
  40   39   * pool.
  41   40   */
  42   41  
↓ open down ↓ 1 lines elided ↑ open up ↑
  44   43  #include <sys/fm/fs/zfs.h>
  45   44  #include <sys/spa_impl.h>
  46   45  #include <sys/zio.h>
  47   46  #include <sys/zio_checksum.h>
  48   47  #include <sys/dmu.h>
  49   48  #include <sys/dmu_tx.h>
  50   49  #include <sys/zap.h>
  51   50  #include <sys/zil.h>
  52   51  #include <sys/ddt.h>
  53   52  #include <sys/vdev_impl.h>
  54      -#include <sys/vdev_removal.h>
  55      -#include <sys/vdev_indirect_mapping.h>
  56      -#include <sys/vdev_indirect_births.h>
  57   53  #include <sys/metaslab.h>
  58   54  #include <sys/metaslab_impl.h>
  59   55  #include <sys/uberblock_impl.h>
  60   56  #include <sys/txg.h>
  61   57  #include <sys/avl.h>
  62      -#include <sys/bpobj.h>
  63   58  #include <sys/dmu_traverse.h>
  64   59  #include <sys/dmu_objset.h>
  65   60  #include <sys/unique.h>
  66   61  #include <sys/dsl_pool.h>
  67   62  #include <sys/dsl_dataset.h>
  68   63  #include <sys/dsl_dir.h>
  69   64  #include <sys/dsl_prop.h>
  70   65  #include <sys/dsl_synctask.h>
  71   66  #include <sys/fs/zfs.h>
  72   67  #include <sys/arc.h>
  73   68  #include <sys/callb.h>
  74   69  #include <sys/systeminfo.h>
  75   70  #include <sys/spa_boot.h>
  76   71  #include <sys/zfs_ioctl.h>
  77   72  #include <sys/dsl_scan.h>
  78   73  #include <sys/zfeature.h>
  79   74  #include <sys/dsl_destroy.h>
       75 +#include <sys/cos.h>
       76 +#include <sys/special.h>
       77 +#include <sys/wbc.h>
  80   78  #include <sys/abd.h>
  81   79  
  82   80  #ifdef  _KERNEL
  83   81  #include <sys/bootprops.h>
  84   82  #include <sys/callb.h>
  85   83  #include <sys/cpupart.h>
  86   84  #include <sys/pool.h>
  87   85  #include <sys/sysdc.h>
  88   86  #include <sys/zone.h>
  89   87  #endif  /* _KERNEL */
  90   88  
  91   89  #include "zfs_prop.h"
  92   90  #include "zfs_comutil.h"
  93   91  
  94   92  /*
  95   93   * The interval, in seconds, at which failed configuration cache file writes
  96   94   * should be retried.
  97   95   */
  98      -int zfs_ccw_retry_interval = 300;
       96 +static int zfs_ccw_retry_interval = 300;
  99   97  
 100   98  typedef enum zti_modes {
 101   99          ZTI_MODE_FIXED,                 /* value is # of threads (min 1) */
 102  100          ZTI_MODE_BATCH,                 /* cpu-intensive; value is ignored */
 103  101          ZTI_MODE_NULL,                  /* don't create a taskq */
 104  102          ZTI_NMODES
 105  103  } zti_modes_t;
 106  104  
 107  105  #define ZTI_P(n, q)     { ZTI_MODE_FIXED, (n), (q) }
 108  106  #define ZTI_BATCH       { ZTI_MODE_BATCH, 0, 1 }
↓ open down ↓ 32 lines elided ↑ open up ↑
 141  139  const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
 142  140          /* ISSUE        ISSUE_HIGH      INTR            INTR_HIGH */
 143  141          { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* NULL */
 144  142          { ZTI_N(8),     ZTI_NULL,       ZTI_P(12, 8),   ZTI_NULL }, /* READ */
 145  143          { ZTI_BATCH,    ZTI_N(5),       ZTI_N(8),       ZTI_N(5) }, /* WRITE */
 146  144          { ZTI_P(12, 8), ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* FREE */
 147  145          { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* CLAIM */
 148  146          { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* IOCTL */
 149  147  };
 150  148  
      149 +static sysevent_t *spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl,
      150 +    const char *name);
      151 +static void spa_event_notify_impl(sysevent_t *ev);
 151  152  static void spa_sync_version(void *arg, dmu_tx_t *tx);
 152  153  static void spa_sync_props(void *arg, dmu_tx_t *tx);
      154 +static void spa_vdev_sync_props(void *arg, dmu_tx_t *tx);
      155 +static int spa_vdev_prop_set_nosync(vdev_t *, nvlist_t *, boolean_t *);
 153  156  static boolean_t spa_has_active_shared_spare(spa_t *spa);
 154      -static int spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
 155      -    boolean_t reloading);
      157 +static int spa_load_impl(spa_t *spa, uint64_t, nvlist_t *config,
      158 +    spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
      159 +    char **ereport);
 156  160  static void spa_vdev_resilver_done(spa_t *spa);
      161 +static void spa_auto_trim(spa_t *spa, uint64_t txg);
      162 +static void spa_vdev_man_trim_done(spa_t *spa);
      163 +static void spa_vdev_auto_trim_done(spa_t *spa);
      164 +static uint64_t spa_min_trim_rate(spa_t *spa);
 157  165  
 158  166  uint_t          zio_taskq_batch_pct = 75;       /* 1 thread per cpu in pset */
 159  167  id_t            zio_taskq_psrset_bind = PS_NONE;
 160  168  boolean_t       zio_taskq_sysdc = B_TRUE;       /* use SDC scheduling class */
 161  169  uint_t          zio_taskq_basedc = 80;          /* base duty cycle */
 162  170  
 163  171  boolean_t       spa_create_process = B_TRUE;    /* no process ==> no sysdc */
 164  172  extern int      zfs_sync_pass_deferred_free;
 165  173  
 166  174  /*
 167      - * Report any spa_load_verify errors found, but do not fail spa_load.
 168      - * This is used by zdb to analyze non-idle pools.
 169      - */
 170      -boolean_t       spa_load_verify_dryrun = B_FALSE;
 171      -
 172      -/*
 173      - * This (illegal) pool name is used when temporarily importing a spa_t in order
 174      - * to get the vdev stats associated with the imported devices.
 175      - */
 176      -#define TRYIMPORT_NAME  "$import"
 177      -
 178      -/*
 179      - * For debugging purposes: print out vdev tree during pool import.
 180      - */
 181      -boolean_t       spa_load_print_vdev_tree = B_FALSE;
 182      -
 183      -/*
 184      - * A non-zero value for zfs_max_missing_tvds means that we allow importing
 185      - * pools with missing top-level vdevs. This is strictly intended for advanced
 186      - * pool recovery cases since missing data is almost inevitable. Pools with
 187      - * missing devices can only be imported read-only for safety reasons, and their
 188      - * fail-mode will be automatically set to "continue".
 189      - *
 190      - * With 1 missing vdev we should be able to import the pool and mount all
 191      - * datasets. User data that was not modified after the missing device has been
 192      - * added should be recoverable. This means that snapshots created prior to the
 193      - * addition of that device should be completely intact.
 194      - *
 195      - * With 2 missing vdevs, some datasets may fail to mount since there are
 196      - * dataset statistics that are stored as regular metadata. Some data might be
 197      - * recoverable if those vdevs were added recently.
 198      - *
 199      - * With 3 or more missing vdevs, the pool is severely damaged and MOS entries
 200      - * may be missing entirely. Chances of data recovery are very low. Note that
 201      - * there are also risks of performing an inadvertent rewind as we might be
 202      - * missing all the vdevs with the latest uberblocks.
 203      - */
 204      -uint64_t        zfs_max_missing_tvds = 0;
 205      -
 206      -/*
 207      - * The parameters below are similar to zfs_max_missing_tvds but are only
 208      - * intended for a preliminary open of the pool with an untrusted config which
 209      - * might be incomplete or out-dated.
 210      - *
 211      - * We are more tolerant for pools opened from a cachefile since we could have
 212      - * an out-dated cachefile where a device removal was not registered.
 213      - * We could have set the limit arbitrarily high but in the case where devices
 214      - * are really missing we would want to return the proper error codes; we chose
 215      - * SPA_DVAS_PER_BP - 1 so that some copies of the MOS would still be available
 216      - * and we get a chance to retrieve the trusted config.
 217      - */
 218      -uint64_t        zfs_max_missing_tvds_cachefile = SPA_DVAS_PER_BP - 1;
 219      -/*
 220      - * In the case where config was assembled by scanning device paths (/dev/dsks
 221      - * by default) we are less tolerant since all the existing devices should have
 222      - * been detected and we want spa_load to return the right error codes.
 223      - */
 224      -uint64_t        zfs_max_missing_tvds_scan = 0;
 225      -
 226      -/*
 227  175   * ==========================================================================
 228  176   * SPA properties routines
 229  177   * ==========================================================================
 230  178   */
 231  179  
 232  180  /*
 233  181   * Add a (source=src, propname=propval) list to an nvlist.
 234  182   */
 235  183  static void
 236  184  spa_prop_add_list(nvlist_t *nvl, zpool_prop_t prop, char *strval,
↓ open down ↓ 15 lines elided ↑ open up ↑
 252  200  }
 253  201  
 254  202  /*
 255  203   * Get property values from the spa configuration.
 256  204   */
 257  205  static void
 258  206  spa_prop_get_config(spa_t *spa, nvlist_t **nvp)
 259  207  {
 260  208          vdev_t *rvd = spa->spa_root_vdev;
 261  209          dsl_pool_t *pool = spa->spa_dsl_pool;
      210 +        spa_meta_placement_t *mp = &spa->spa_meta_policy;
 262  211          uint64_t size, alloc, cap, version;
 263  212          zprop_source_t src = ZPROP_SRC_NONE;
 264  213          spa_config_dirent_t *dp;
 265  214          metaslab_class_t *mc = spa_normal_class(spa);
 266  215  
 267  216          ASSERT(MUTEX_HELD(&spa->spa_props_lock));
 268  217  
 269  218          if (rvd != NULL) {
 270  219                  alloc = metaslab_class_get_alloc(spa_normal_class(spa));
 271  220                  size = metaslab_class_get_space(spa_normal_class(spa));
 272  221                  spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src);
 273  222                  spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src);
 274  223                  spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src);
 275  224                  spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL,
 276  225                      size - alloc, src);
      226 +                spa_prop_add_list(*nvp, ZPOOL_PROP_ENABLESPECIAL, NULL,
      227 +                    (uint64_t)spa->spa_usesc, src);
      228 +                spa_prop_add_list(*nvp, ZPOOL_PROP_MINWATERMARK, NULL,
      229 +                    spa->spa_minwat, src);
      230 +                spa_prop_add_list(*nvp, ZPOOL_PROP_HIWATERMARK, NULL,
      231 +                    spa->spa_hiwat, src);
      232 +                spa_prop_add_list(*nvp, ZPOOL_PROP_LOWATERMARK, NULL,
      233 +                    spa->spa_lowat, src);
      234 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPMETA_DITTO, NULL,
      235 +                    spa->spa_ddt_meta_copies, src);
 277  236  
      237 +                spa_prop_add_list(*nvp, ZPOOL_PROP_META_PLACEMENT, NULL,
      238 +                    mp->spa_enable_meta_placement_selection, src);
      239 +                spa_prop_add_list(*nvp, ZPOOL_PROP_SYNC_TO_SPECIAL, NULL,
      240 +                    mp->spa_sync_to_special, src);
      241 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_META_TO_METADEV, NULL,
      242 +                    mp->spa_ddt_meta_to_special, src);
      243 +                spa_prop_add_list(*nvp, ZPOOL_PROP_ZFS_META_TO_METADEV,
      244 +                    NULL, mp->spa_zfs_meta_to_special, src);
      245 +                spa_prop_add_list(*nvp, ZPOOL_PROP_SMALL_DATA_TO_METADEV, NULL,
      246 +                    mp->spa_small_data_to_special, src);
      247 +
 278  248                  spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL,
 279  249                      metaslab_class_fragmentation(mc), src);
 280  250                  spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL,
 281  251                      metaslab_class_expandable_space(mc), src);
 282  252                  spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL,
 283  253                      (spa_mode(spa) == FREAD), src);
 284  254  
      255 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_DESEGREGATION, NULL,
      256 +                    (spa->spa_ddt_class_min == spa->spa_ddt_class_max), src);
      257 +
 285  258                  cap = (size == 0) ? 0 : (alloc * 100 / size);
 286  259                  spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src);
 287  260  
      261 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_BEST_EFFORT, NULL,
      262 +                    spa->spa_dedup_best_effort, src);
      263 +
      264 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT, NULL,
      265 +                    spa->spa_dedup_lo_best_effort, src);
      266 +
      267 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT, NULL,
      268 +                    spa->spa_dedup_hi_best_effort, src);
      269 +
 288  270                  spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL,
 289  271                      ddt_get_pool_dedup_ratio(spa), src);
 290  272  
      273 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DDTCAPPED, NULL,
      274 +                    spa->spa_ddt_capped, src);
      275 +
 291  276                  spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL,
 292  277                      rvd->vdev_state, src);
 293  278  
 294  279                  version = spa_version(spa);
 295  280                  if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION))
 296  281                          src = ZPROP_SRC_DEFAULT;
 297  282                  else
 298  283                          src = ZPROP_SRC_LOCAL;
 299  284                  spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL, version, src);
 300  285          }
 301  286  
 302  287          if (pool != NULL) {
 303  288                  /*
 304  289                   * The $FREE directory was introduced in SPA_VERSION_DEADLISTS,
 305  290                   * when opening pools before this version freedir will be NULL.
 306  291                   */
 307  292                  if (pool->dp_free_dir != NULL) {
 308  293                          spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL,
 309      -                            dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes,
      294 +                            dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes +
      295 +                            pool->dp_long_freeing_total,
 310  296                              src);
 311  297                  } else {
 312  298                          spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING,
 313      -                            NULL, 0, src);
      299 +                            NULL, pool->dp_long_freeing_total, src);
 314  300                  }
 315  301  
 316  302                  if (pool->dp_leak_dir != NULL) {
 317  303                          spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL,
 318  304                              dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
 319  305                              src);
 320  306                  } else {
 321  307                          spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED,
 322  308                              NULL, 0, src);
 323  309                  }
↓ open down ↓ 59 lines elided ↑ open up ↑
 383  369           * Get properties from the MOS pool property object.
 384  370           */
 385  371          for (zap_cursor_init(&zc, mos, spa->spa_pool_props_object);
 386  372              (err = zap_cursor_retrieve(&zc, &za)) == 0;
 387  373              zap_cursor_advance(&zc)) {
 388  374                  uint64_t intval = 0;
 389  375                  char *strval = NULL;
 390  376                  zprop_source_t src = ZPROP_SRC_DEFAULT;
 391  377                  zpool_prop_t prop;
 392  378  
 393      -                if ((prop = zpool_name_to_prop(za.za_name)) == ZPOOL_PROP_INVAL)
      379 +                if ((prop = zpool_name_to_prop(za.za_name)) == ZPROP_INVAL)
 394  380                          continue;
 395  381  
 396  382                  switch (za.za_integer_length) {
 397  383                  case 8:
 398  384                          /* integer property */
 399  385                          if (za.za_first_integer !=
 400  386                              zpool_prop_default_numeric(prop))
 401  387                                  src = ZPROP_SRC_LOCAL;
 402  388  
 403  389                          if (prop == ZPOOL_PROP_BOOTFS) {
↓ open down ↓ 58 lines elided ↑ open up ↑
 462  448   * Validate the given pool properties nvlist and modify the list
 463  449   * for the property values to be set.
 464  450   */
 465  451  static int
 466  452  spa_prop_validate(spa_t *spa, nvlist_t *props)
 467  453  {
 468  454          nvpair_t *elem;
 469  455          int error = 0, reset_bootfs = 0;
 470  456          uint64_t objnum = 0;
 471  457          boolean_t has_feature = B_FALSE;
      458 +        uint64_t lowat = spa->spa_lowat, hiwat = spa->spa_hiwat,
      459 +            minwat = spa->spa_minwat;
 472  460  
 473  461          elem = NULL;
 474  462          while ((elem = nvlist_next_nvpair(props, elem)) != NULL) {
 475  463                  uint64_t intval;
 476  464                  char *strval, *slash, *check, *fname;
 477  465                  const char *propname = nvpair_name(elem);
 478  466                  zpool_prop_t prop = zpool_name_to_prop(propname);
      467 +                spa_feature_t feature;
 479  468  
 480  469                  switch (prop) {
 481      -                case ZPOOL_PROP_INVAL:
      470 +                case ZPROP_INVAL:
 482  471                          if (!zpool_prop_feature(propname)) {
 483  472                                  error = SET_ERROR(EINVAL);
 484  473                                  break;
 485  474                          }
 486  475  
 487  476                          /*
 488  477                           * Sanitize the input.
 489  478                           */
 490  479                          if (nvpair_type(elem) != DATA_TYPE_UINT64) {
 491  480                                  error = SET_ERROR(EINVAL);
↓ open down ↓ 4 lines elided ↑ open up ↑
 496  485                                  error = SET_ERROR(EINVAL);
 497  486                                  break;
 498  487                          }
 499  488  
 500  489                          if (intval != 0) {
 501  490                                  error = SET_ERROR(EINVAL);
 502  491                                  break;
 503  492                          }
 504  493  
 505  494                          fname = strchr(propname, '@') + 1;
 506      -                        if (zfeature_lookup_name(fname, NULL) != 0) {
      495 +                        if (zfeature_lookup_name(fname, &feature) != 0) {
 507  496                                  error = SET_ERROR(EINVAL);
 508  497                                  break;
 509  498                          }
 510  499  
      500 +                        if (feature == SPA_FEATURE_WBC &&
      501 +                            !spa_has_special(spa)) {
      502 +                                error = SET_ERROR(ENOTSUP);
      503 +                                break;
      504 +                        }
      505 +
 511  506                          has_feature = B_TRUE;
 512  507                          break;
 513  508  
 514  509                  case ZPOOL_PROP_VERSION:
 515  510                          error = nvpair_value_uint64(elem, &intval);
 516  511                          if (!error &&
 517  512                              (intval < spa_version(spa) ||
 518  513                              intval > SPA_VERSION_BEFORE_FEATURES ||
 519  514                              has_feature))
 520  515                                  error = SET_ERROR(EINVAL);
 521  516                          break;
 522  517  
 523  518                  case ZPOOL_PROP_DELEGATION:
 524  519                  case ZPOOL_PROP_AUTOREPLACE:
 525  520                  case ZPOOL_PROP_LISTSNAPS:
 526  521                  case ZPOOL_PROP_AUTOEXPAND:
      522 +                case ZPOOL_PROP_DEDUP_BEST_EFFORT:
      523 +                case ZPOOL_PROP_DDT_DESEGREGATION:
      524 +                case ZPOOL_PROP_META_PLACEMENT:
      525 +                case ZPOOL_PROP_FORCETRIM:
      526 +                case ZPOOL_PROP_AUTOTRIM:
 527  527                          error = nvpair_value_uint64(elem, &intval);
 528  528                          if (!error && intval > 1)
 529  529                                  error = SET_ERROR(EINVAL);
 530  530                          break;
 531  531  
      532 +                case ZPOOL_PROP_DDT_META_TO_METADEV:
      533 +                case ZPOOL_PROP_ZFS_META_TO_METADEV:
      534 +                        error = nvpair_value_uint64(elem, &intval);
      535 +                        if (!error && intval > META_PLACEMENT_DUAL)
      536 +                                error = SET_ERROR(EINVAL);
      537 +                        break;
      538 +
      539 +                case ZPOOL_PROP_SYNC_TO_SPECIAL:
      540 +                        error = nvpair_value_uint64(elem, &intval);
      541 +                        if (!error && intval > SYNC_TO_SPECIAL_ALWAYS)
      542 +                                error = SET_ERROR(EINVAL);
      543 +                        break;
      544 +
      545 +                case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
      546 +                        error = nvpair_value_uint64(elem, &intval);
      547 +                        if (!error && intval > SPA_MAXBLOCKSIZE)
      548 +                                error = SET_ERROR(EINVAL);
      549 +                        break;
      550 +
 532  551                  case ZPOOL_PROP_BOOTFS:
 533  552                          /*
 534  553                           * If the pool version is less than SPA_VERSION_BOOTFS,
 535  554                           * or the pool is still being created (version == 0),
 536  555                           * the bootfs property cannot be set.
 537  556                           */
 538  557                          if (spa_version(spa) < SPA_VERSION_BOOTFS) {
 539  558                                  error = SET_ERROR(ENOTSUP);
 540  559                                  break;
 541  560                          }
↓ open down ↓ 37 lines elided ↑ open up ↑
 579  598                                      &propval)) == 0 &&
 580  599                                      !BOOTFS_COMPRESS_VALID(propval)) {
 581  600                                          error = SET_ERROR(ENOTSUP);
 582  601                                  } else {
 583  602                                          objnum = dmu_objset_id(os);
 584  603                                  }
 585  604                                  dmu_objset_rele(os, FTAG);
 586  605                          }
 587  606                          break;
 588  607  
      608 +                case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
      609 +                        error = nvpair_value_uint64(elem, &intval);
      610 +                        if ((intval < 0) || (intval > 100) ||
      611 +                            (intval >= spa->spa_dedup_hi_best_effort))
      612 +                                error = SET_ERROR(EINVAL);
      613 +                        break;
      614 +
      615 +                case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
      616 +                        error = nvpair_value_uint64(elem, &intval);
      617 +                        if ((intval < 0) || (intval > 100) ||
      618 +                            (intval <= spa->spa_dedup_lo_best_effort))
      619 +                                error = SET_ERROR(EINVAL);
      620 +                        break;
      621 +
 589  622                  case ZPOOL_PROP_FAILUREMODE:
 590  623                          error = nvpair_value_uint64(elem, &intval);
 591  624                          if (!error && (intval < ZIO_FAILURE_MODE_WAIT ||
 592  625                              intval > ZIO_FAILURE_MODE_PANIC))
 593  626                                  error = SET_ERROR(EINVAL);
 594  627  
 595  628                          /*
 596  629                           * This is a special case which only occurs when
 597  630                           * the pool has completely failed. This allows
 598  631                           * the user to change the in-core failmode property
↓ open down ↓ 41 lines elided ↑ open up ↑
 640  673                                   * check.  For this kernel check, we merely
 641  674                                   * check ASCII apart from DEL.  Fix this if
 642  675                                   * there is an easy-to-use kernel isprint().
 643  676                                   */
 644  677                                  if (*check >= 0x7f) {
 645  678                                          error = SET_ERROR(EINVAL);
 646  679                                          break;
 647  680                                  }
 648  681                          }
 649  682                          if (strlen(strval) > ZPROP_MAX_COMMENT)
 650      -                                error = E2BIG;
      683 +                                error = SET_ERROR(E2BIG);
 651  684                          break;
 652  685  
 653  686                  case ZPOOL_PROP_DEDUPDITTO:
 654  687                          if (spa_version(spa) < SPA_VERSION_DEDUP)
 655  688                                  error = SET_ERROR(ENOTSUP);
 656  689                          else
 657  690                                  error = nvpair_value_uint64(elem, &intval);
 658  691                          if (error == 0 &&
 659  692                              intval != 0 && intval < ZIO_DEDUPDITTO_MIN)
 660  693                                  error = SET_ERROR(EINVAL);
 661  694                          break;
      695 +
      696 +                case ZPOOL_PROP_MINWATERMARK:
      697 +                        error = nvpair_value_uint64(elem, &intval);
      698 +                        if (!error && (intval > 100))
      699 +                                error = SET_ERROR(EINVAL);
      700 +                        minwat = intval;
      701 +                        break;
      702 +                case ZPOOL_PROP_LOWATERMARK:
      703 +                        error = nvpair_value_uint64(elem, &intval);
      704 +                        if (!error && (intval > 100))
      705 +                                error = SET_ERROR(EINVAL);
      706 +                        lowat = intval;
      707 +                        break;
      708 +                case ZPOOL_PROP_HIWATERMARK:
      709 +                        error = nvpair_value_uint64(elem, &intval);
      710 +                        if (!error && (intval > 100))
      711 +                                error = SET_ERROR(EINVAL);
      712 +                        hiwat = intval;
      713 +                        break;
      714 +                case ZPOOL_PROP_DEDUPMETA_DITTO:
      715 +                        error = nvpair_value_uint64(elem, &intval);
      716 +                        if (!error && (intval > SPA_DVAS_PER_BP))
      717 +                                error = SET_ERROR(EINVAL);
      718 +                        break;
      719 +                case ZPOOL_PROP_SCRUB_PRIO:
      720 +                case ZPOOL_PROP_RESILVER_PRIO:
      721 +                        error = nvpair_value_uint64(elem, &intval);
      722 +                        if (error || intval > 100)
      723 +                                error = SET_ERROR(EINVAL);
      724 +                        break;
 662  725                  }
 663  726  
 664  727                  if (error)
 665  728                          break;
 666  729          }
 667  730  
      731 +        /* check if low watermark is less than high watermark */
      732 +        if (lowat != 0 && lowat >= hiwat)
      733 +                error = SET_ERROR(EINVAL);
      734 +
      735 +        /* check if min watermark is less than low watermark */
      736 +        if (minwat != 0 && minwat >= lowat)
      737 +                error = SET_ERROR(EINVAL);
      738 +
 668  739          if (!error && reset_bootfs) {
 669  740                  error = nvlist_remove(props,
 670  741                      zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING);
 671  742  
 672  743                  if (!error) {
 673  744                          error = nvlist_add_uint64(props,
 674  745                              zpool_prop_to_name(ZPOOL_PROP_BOOTFS), objnum);
 675  746                  }
 676  747          }
 677  748  
↓ open down ↓ 36 lines elided ↑ open up ↑
 714  785                  return (error);
 715  786  
 716  787          while ((elem = nvlist_next_nvpair(nvp, elem)) != NULL) {
 717  788                  zpool_prop_t prop = zpool_name_to_prop(nvpair_name(elem));
 718  789  
 719  790                  if (prop == ZPOOL_PROP_CACHEFILE ||
 720  791                      prop == ZPOOL_PROP_ALTROOT ||
 721  792                      prop == ZPOOL_PROP_READONLY)
 722  793                          continue;
 723  794  
 724      -                if (prop == ZPOOL_PROP_VERSION || prop == ZPOOL_PROP_INVAL) {
      795 +                if (prop == ZPOOL_PROP_VERSION || prop == ZPROP_INVAL) {
 725  796                          uint64_t ver;
 726  797  
 727  798                          if (prop == ZPOOL_PROP_VERSION) {
 728  799                                  VERIFY(nvpair_value_uint64(elem, &ver) == 0);
 729  800                          } else {
 730  801                                  ASSERT(zpool_prop_feature(nvpair_name(elem)));
 731  802                                  ver = SPA_VERSION_FEATURES;
 732  803                                  need_sync = B_TRUE;
 733  804                          }
 734  805  
↓ open down ↓ 98 lines elided ↑ open up ↑
 833  904          uint64_t guid;
 834  905  
 835  906          mutex_enter(&spa->spa_vdev_top_lock);
 836  907          mutex_enter(&spa_namespace_lock);
 837  908          guid = spa_generate_guid(NULL);
 838  909  
 839  910          error = dsl_sync_task(spa->spa_name, spa_change_guid_check,
 840  911              spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED);
 841  912  
 842  913          if (error == 0) {
 843      -                spa_write_cachefile(spa, B_FALSE, B_TRUE);
      914 +                spa_config_sync(spa, B_FALSE, B_TRUE);
 844  915                  spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID);
 845  916          }
 846  917  
 847  918          mutex_exit(&spa_namespace_lock);
 848  919          mutex_exit(&spa->spa_vdev_top_lock);
 849  920  
 850  921          return (error);
 851  922  }
 852  923  
 853  924  /*
↓ open down ↓ 247 lines elided ↑ open up ↑
1101 1172  static void
1102 1173  spa_activate(spa_t *spa, int mode)
1103 1174  {
1104 1175          ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
1105 1176  
1106 1177          spa->spa_state = POOL_STATE_ACTIVE;
1107 1178          spa->spa_mode = mode;
1108 1179  
1109 1180          spa->spa_normal_class = metaslab_class_create(spa, zfs_metaslab_ops);
1110 1181          spa->spa_log_class = metaslab_class_create(spa, zfs_metaslab_ops);
     1182 +        spa->spa_special_class = metaslab_class_create(spa, zfs_metaslab_ops);
1111 1183  
1112 1184          /* Try to create a covering process */
1113 1185          mutex_enter(&spa->spa_proc_lock);
1114 1186          ASSERT(spa->spa_proc_state == SPA_PROC_NONE);
1115 1187          ASSERT(spa->spa_proc == &p0);
1116 1188          spa->spa_did = 0;
1117 1189  
1118 1190          /* Only create a process if we're going to be around a while. */
1119 1191          if (spa_create_process && strcmp(spa->spa_name, TRYIMPORT_NAME) != 0) {
1120 1192                  if (newproc(spa_thread, (caddr_t)spa, syscid, maxclsyspri,
↓ open down ↓ 14 lines elided ↑ open up ↑
1135 1207  #endif
1136 1208                  }
1137 1209          }
1138 1210          mutex_exit(&spa->spa_proc_lock);
1139 1211  
1140 1212          /* If we didn't create a process, we need to create our taskqs. */
1141 1213          if (spa->spa_proc == &p0) {
1142 1214                  spa_create_zio_taskqs(spa);
1143 1215          }
1144 1216  
1145      -        for (size_t i = 0; i < TXG_SIZE; i++)
1146      -                spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, 0);
1147      -
1148 1217          list_create(&spa->spa_config_dirty_list, sizeof (vdev_t),
1149 1218              offsetof(vdev_t, vdev_config_dirty_node));
1150 1219          list_create(&spa->spa_evicting_os_list, sizeof (objset_t),
1151 1220              offsetof(objset_t, os_evicting_node));
1152 1221          list_create(&spa->spa_state_dirty_list, sizeof (vdev_t),
1153 1222              offsetof(vdev_t, vdev_state_dirty_node));
1154 1223  
1155 1224          txg_list_create(&spa->spa_vdev_txg_list, spa,
1156 1225              offsetof(struct vdev, vdev_txg_node));
1157 1226  
↓ open down ↓ 24 lines elided ↑ open up ↑
1182 1251          list_destroy(&spa->spa_config_dirty_list);
1183 1252          list_destroy(&spa->spa_evicting_os_list);
1184 1253          list_destroy(&spa->spa_state_dirty_list);
1185 1254  
1186 1255          for (int t = 0; t < ZIO_TYPES; t++) {
1187 1256                  for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1188 1257                          spa_taskqs_fini(spa, t, q);
1189 1258                  }
1190 1259          }
1191 1260  
1192      -        for (size_t i = 0; i < TXG_SIZE; i++) {
1193      -                ASSERT3P(spa->spa_txg_zio[i], !=, NULL);
1194      -                VERIFY0(zio_wait(spa->spa_txg_zio[i]));
1195      -                spa->spa_txg_zio[i] = NULL;
1196      -        }
1197      -
1198 1261          metaslab_class_destroy(spa->spa_normal_class);
1199 1262          spa->spa_normal_class = NULL;
1200 1263  
1201 1264          metaslab_class_destroy(spa->spa_log_class);
1202 1265          spa->spa_log_class = NULL;
1203 1266  
     1267 +        metaslab_class_destroy(spa->spa_special_class);
     1268 +        spa->spa_special_class = NULL;
     1269 +
1204 1270          /*
1205 1271           * If this was part of an import or the open otherwise failed, we may
1206 1272           * still have errors left in the queues.  Empty them just in case.
1207 1273           */
1208 1274          spa_errlog_drain(spa);
1209 1275  
1210 1276          avl_destroy(&spa->spa_errlist_scrub);
1211 1277          avl_destroy(&spa->spa_errlist_last);
1212 1278  
1213 1279          spa->spa_state = POOL_STATE_UNINITIALIZED;
↓ open down ↓ 74 lines elided ↑ open up ↑
1288 1354  /*
1289 1355   * Opposite of spa_load().
1290 1356   */
1291 1357  static void
1292 1358  spa_unload(spa_t *spa)
1293 1359  {
1294 1360          int i;
1295 1361  
1296 1362          ASSERT(MUTEX_HELD(&spa_namespace_lock));
1297 1363  
1298      -        spa_load_note(spa, "UNLOADING");
     1364 +        /*
     1365 +         * Stop manual trim before stopping spa sync, because manual trim
     1366 +         * needs to execute a synctask (trim timestamp sync) at the end.
     1367 +         */
     1368 +        mutex_enter(&spa->spa_auto_trim_lock);
     1369 +        mutex_enter(&spa->spa_man_trim_lock);
     1370 +        spa_trim_stop_wait(spa);
     1371 +        mutex_exit(&spa->spa_man_trim_lock);
     1372 +        mutex_exit(&spa->spa_auto_trim_lock);
1299 1373  
1300 1374          /*
1301 1375           * Stop async tasks.
1302 1376           */
1303 1377          spa_async_suspend(spa);
1304 1378  
1305 1379          /*
1306 1380           * Stop syncing.
1307 1381           */
1308 1382          if (spa->spa_sync_on) {
↓ open down ↓ 17 lines elided ↑ open up ↑
1326 1400          /*
1327 1401           * Wait for any outstanding async I/O to complete.
1328 1402           */
1329 1403          if (spa->spa_async_zio_root != NULL) {
1330 1404                  for (int i = 0; i < max_ncpus; i++)
1331 1405                          (void) zio_wait(spa->spa_async_zio_root[i]);
1332 1406                  kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *));
1333 1407                  spa->spa_async_zio_root = NULL;
1334 1408          }
1335 1409  
1336      -        if (spa->spa_vdev_removal != NULL) {
1337      -                spa_vdev_removal_destroy(spa->spa_vdev_removal);
1338      -                spa->spa_vdev_removal = NULL;
1339      -        }
1340      -
1341      -        if (spa->spa_condense_zthr != NULL) {
1342      -                ASSERT(!zthr_isrunning(spa->spa_condense_zthr));
1343      -                zthr_destroy(spa->spa_condense_zthr);
1344      -                spa->spa_condense_zthr = NULL;
1345      -        }
1346      -
1347      -        spa_condense_fini(spa);
1348      -
1349 1410          bpobj_close(&spa->spa_deferred_bpobj);
1350 1411  
1351 1412          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1352 1413  
1353 1414          /*
     1415 +         * Stop autotrim tasks.
     1416 +         */
     1417 +        mutex_enter(&spa->spa_auto_trim_lock);
     1418 +        if (spa->spa_auto_trim_taskq)
     1419 +                spa_auto_trim_taskq_destroy(spa);
     1420 +        mutex_exit(&spa->spa_auto_trim_lock);
     1421 +
     1422 +        /*
1354 1423           * Close all vdevs.
1355 1424           */
1356 1425          if (spa->spa_root_vdev)
1357 1426                  vdev_free(spa->spa_root_vdev);
1358 1427          ASSERT(spa->spa_root_vdev == NULL);
1359 1428  
1360 1429          /*
1361 1430           * Close the dsl pool.
1362 1431           */
1363 1432          if (spa->spa_dsl_pool) {
↓ open down ↓ 32 lines elided ↑ open up ↑
1396 1465                  spa->spa_l2cache.sav_vdevs = NULL;
1397 1466          }
1398 1467          if (spa->spa_l2cache.sav_config) {
1399 1468                  nvlist_free(spa->spa_l2cache.sav_config);
1400 1469                  spa->spa_l2cache.sav_config = NULL;
1401 1470          }
1402 1471          spa->spa_l2cache.sav_count = 0;
1403 1472  
1404 1473          spa->spa_async_suspended = 0;
1405 1474  
1406      -        spa->spa_indirect_vdevs_loaded = B_FALSE;
1407      -
1408 1475          if (spa->spa_comment != NULL) {
1409 1476                  spa_strfree(spa->spa_comment);
1410 1477                  spa->spa_comment = NULL;
1411 1478          }
1412 1479  
1413 1480          spa_config_exit(spa, SCL_ALL, FTAG);
1414 1481  }
1415 1482  
1416 1483  /*
1417 1484   * Load (or re-load) the current list of vdevs describing the active spares for
1418 1485   * this pool.  When this is called, we have some form of basic information in
1419 1486   * 'spa_spares.sav_config'.  We parse this into vdevs, try to open them, and
1420 1487   * then re-generate a more complete list including status information.
1421 1488   */
1422      -void
     1489 +static void
1423 1490  spa_load_spares(spa_t *spa)
1424 1491  {
1425 1492          nvlist_t **spares;
1426 1493          uint_t nspares;
1427 1494          int i;
1428 1495          vdev_t *vd, *tvd;
1429 1496  
1430 1497          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1431 1498  
1432 1499          /*
↓ open down ↓ 96 lines elided ↑ open up ↑
1529 1596  }
1530 1597  
1531 1598  /*
1532 1599   * Load (or re-load) the current list of vdevs describing the active l2cache for
1533 1600   * this pool.  When this is called, we have some form of basic information in
1534 1601   * 'spa_l2cache.sav_config'.  We parse this into vdevs, try to open them, and
1535 1602   * then re-generate a more complete list including status information.
1536 1603   * Devices which are already active have their details maintained, and are
1537 1604   * not re-opened.
1538 1605   */
1539      -void
     1606 +static void
1540 1607  spa_load_l2cache(spa_t *spa)
1541 1608  {
1542 1609          nvlist_t **l2cache;
1543 1610          uint_t nl2cache;
1544 1611          int i, j, oldnvdevs;
1545 1612          uint64_t guid;
1546 1613          vdev_t *vd, **oldvdevs, **newvdevs;
1547 1614          spa_aux_vdev_t *sav = &spa->spa_l2cache;
1548 1615  
1549 1616          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
↓ open down ↓ 50 lines elided ↑ open up ↑
1600 1667                          vd->vdev_top = vd;
1601 1668                          vd->vdev_aux = sav;
1602 1669  
1603 1670                          spa_l2cache_activate(vd);
1604 1671  
1605 1672                          if (vdev_open(vd) != 0)
1606 1673                                  continue;
1607 1674  
1608 1675                          (void) vdev_validate_aux(vd);
1609 1676  
1610      -                        if (!vdev_is_dead(vd))
1611      -                                l2arc_add_vdev(spa, vd);
     1677 +                        if (!vdev_is_dead(vd)) {
     1678 +                                boolean_t do_rebuild = B_FALSE;
     1679 +
     1680 +                                (void) nvlist_lookup_boolean_value(l2cache[i],
     1681 +                                    ZPOOL_CONFIG_L2CACHE_PERSISTENT,
     1682 +                                    &do_rebuild);
     1683 +                                l2arc_add_vdev(spa, vd, do_rebuild);
     1684 +                        }
1612 1685                  }
1613 1686          }
1614 1687  
1615 1688          /*
1616 1689           * Purge vdevs that were dropped
1617 1690           */
1618 1691          for (i = 0; i < oldnvdevs; i++) {
1619 1692                  uint64_t pool;
1620 1693  
1621 1694                  vd = oldvdevs[i];
↓ open down ↓ 57 lines elided ↑ open up ↑
1679 1752          error = dmu_read(spa->spa_meta_objset, obj, 0, nvsize, packed,
1680 1753              DMU_READ_PREFETCH);
1681 1754          if (error == 0)
1682 1755                  error = nvlist_unpack(packed, nvsize, value, 0);
1683 1756          kmem_free(packed, nvsize);
1684 1757  
1685 1758          return (error);
1686 1759  }
1687 1760  
1688 1761  /*
1689      - * Concrete top-level vdevs that are not missing and are not logs. At every
1690      - * spa_sync we write new uberblocks to at least SPA_SYNC_MIN_VDEVS core tvds.
1691      - */
1692      -static uint64_t
1693      -spa_healthy_core_tvds(spa_t *spa)
1694      -{
1695      -        vdev_t *rvd = spa->spa_root_vdev;
1696      -        uint64_t tvds = 0;
1697      -
1698      -        for (uint64_t i = 0; i < rvd->vdev_children; i++) {
1699      -                vdev_t *vd = rvd->vdev_child[i];
1700      -                if (vd->vdev_islog)
1701      -                        continue;
1702      -                if (vdev_is_concrete(vd) && !vdev_is_dead(vd))
1703      -                        tvds++;
1704      -        }
1705      -
1706      -        return (tvds);
1707      -}
1708      -
1709      -/*
1710 1762   * Checks to see if the given vdev could not be opened, in which case we post a
1711 1763   * sysevent to notify the autoreplace code that the device has been removed.
1712 1764   */
1713 1765  static void
1714 1766  spa_check_removed(vdev_t *vd)
1715 1767  {
1716      -        for (uint64_t c = 0; c < vd->vdev_children; c++)
     1768 +        for (int c = 0; c < vd->vdev_children; c++)
1717 1769                  spa_check_removed(vd->vdev_child[c]);
1718 1770  
1719 1771          if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) &&
1720      -            vdev_is_concrete(vd)) {
     1772 +            !vd->vdev_ishole) {
1721 1773                  zfs_post_autoreplace(vd->vdev_spa, vd);
1722 1774                  spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK);
1723 1775          }
1724 1776  }
1725 1777  
1726      -static int
1727      -spa_check_for_missing_logs(spa_t *spa)
     1778 +static void
     1779 +spa_config_valid_zaps(vdev_t *vd, vdev_t *mvd)
1728 1780  {
1729      -        vdev_t *rvd = spa->spa_root_vdev;
     1781 +        ASSERT3U(vd->vdev_children, ==, mvd->vdev_children);
1730 1782  
     1783 +        vd->vdev_top_zap = mvd->vdev_top_zap;
     1784 +        vd->vdev_leaf_zap = mvd->vdev_leaf_zap;
     1785 +
     1786 +        for (uint64_t i = 0; i < vd->vdev_children; i++) {
     1787 +                spa_config_valid_zaps(vd->vdev_child[i], mvd->vdev_child[i]);
     1788 +        }
     1789 +}
     1790 +
     1791 +/*
     1792 + * Validate the current config against the MOS config
     1793 + */
     1794 +static boolean_t
     1795 +spa_config_valid(spa_t *spa, nvlist_t *config)
     1796 +{
     1797 +        vdev_t *mrvd, *rvd = spa->spa_root_vdev;
     1798 +        nvlist_t *nv;
     1799 +
     1800 +        VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nv) == 0);
     1801 +
     1802 +        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
     1803 +        VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
     1804 +
1731 1805          /*
     1806 +         * One of the earliest signs of a stale config is a mismatch
     1807 +         * in the numbers of children vdev's
     1808 +         */
     1809 +        if (rvd->vdev_children != mrvd->vdev_children) {
     1810 +                vdev_free(mrvd);
     1811 +                spa_config_exit(spa, SCL_ALL, FTAG);
     1812 +                return (B_FALSE);
     1813 +        }
     1814 +        /*
1732 1815           * If we're doing a normal import, then build up any additional
1733      -         * diagnostic information about missing log devices.
     1816 +         * diagnostic information about missing devices in this config.
1734 1817           * We'll pass this up to the user for further processing.
1735 1818           */
1736 1819          if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) {
1737 1820                  nvlist_t **child, *nv;
1738 1821                  uint64_t idx = 0;
1739 1822  
1740 1823                  child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **),
1741 1824                      KM_SLEEP);
1742 1825                  VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
1743 1826  
1744      -                for (uint64_t c = 0; c < rvd->vdev_children; c++) {
     1827 +                for (int c = 0; c < rvd->vdev_children; c++) {
1745 1828                          vdev_t *tvd = rvd->vdev_child[c];
     1829 +                        vdev_t *mtvd  = mrvd->vdev_child[c];
1746 1830  
1747      -                        /*
1748      -                         * We consider a device as missing only if it failed
1749      -                         * to open (i.e. offline or faulted is not considered
1750      -                         * as missing).
1751      -                         */
1752      -                        if (tvd->vdev_islog &&
1753      -                            tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
1754      -                                child[idx++] = vdev_config_generate(spa, tvd,
1755      -                                    B_FALSE, VDEV_CONFIG_MISSING);
1756      -                        }
     1831 +                        if (tvd->vdev_ops == &vdev_missing_ops &&
     1832 +                            mtvd->vdev_ops != &vdev_missing_ops &&
     1833 +                            mtvd->vdev_islog)
     1834 +                                child[idx++] = vdev_config_generate(spa, mtvd,
     1835 +                                    B_FALSE, 0);
1757 1836                  }
1758 1837  
1759      -                if (idx > 0) {
1760      -                        fnvlist_add_nvlist_array(nv,
1761      -                            ZPOOL_CONFIG_CHILDREN, child, idx);
1762      -                        fnvlist_add_nvlist(spa->spa_load_info,
1763      -                            ZPOOL_CONFIG_MISSING_DEVICES, nv);
     1838 +                if (idx) {
     1839 +                        VERIFY(nvlist_add_nvlist_array(nv,
     1840 +                            ZPOOL_CONFIG_CHILDREN, child, idx) == 0);
     1841 +                        VERIFY(nvlist_add_nvlist(spa->spa_load_info,
     1842 +                            ZPOOL_CONFIG_MISSING_DEVICES, nv) == 0);
1764 1843  
1765      -                        for (uint64_t i = 0; i < idx; i++)
     1844 +                        for (int i = 0; i < idx; i++)
1766 1845                                  nvlist_free(child[i]);
1767 1846                  }
1768 1847                  nvlist_free(nv);
1769 1848                  kmem_free(child, rvd->vdev_children * sizeof (char **));
     1849 +        }
1770 1850  
1771      -                if (idx > 0) {
1772      -                        spa_load_failed(spa, "some log devices are missing");
1773      -                        return (SET_ERROR(ENXIO));
1774      -                }
1775      -        } else {
1776      -                for (uint64_t c = 0; c < rvd->vdev_children; c++) {
1777      -                        vdev_t *tvd = rvd->vdev_child[c];
     1851 +        /*
     1852 +         * Compare the root vdev tree with the information we have
     1853 +         * from the MOS config (mrvd). Check each top-level vdev
     1854 +         * with the corresponding MOS config top-level (mtvd).
     1855 +         */
     1856 +        for (int c = 0; c < rvd->vdev_children; c++) {
     1857 +                vdev_t *tvd = rvd->vdev_child[c];
     1858 +                vdev_t *mtvd  = mrvd->vdev_child[c];
1778 1859  
1779      -                        if (tvd->vdev_islog &&
1780      -                            tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
     1860 +                /*
     1861 +                 * Resolve any "missing" vdevs in the current configuration.
     1862 +                 * If we find that the MOS config has more accurate information
     1863 +                 * about the top-level vdev then use that vdev instead.
     1864 +                 */
     1865 +                if (tvd->vdev_ops == &vdev_missing_ops &&
     1866 +                    mtvd->vdev_ops != &vdev_missing_ops) {
     1867 +
     1868 +                        if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG))
     1869 +                                continue;
     1870 +
     1871 +                        /*
     1872 +                         * Device specific actions.
     1873 +                         */
     1874 +                        if (mtvd->vdev_islog) {
1781 1875                                  spa_set_log_state(spa, SPA_LOG_CLEAR);
1782      -                                spa_load_note(spa, "some log devices are "
1783      -                                    "missing, ZIL is dropped.");
1784      -                                break;
     1876 +                        } else {
     1877 +                                /*
     1878 +                                 * XXX - once we have 'readonly' pool
     1879 +                                 * support we should be able to handle
     1880 +                                 * missing data devices by transitioning
     1881 +                                 * the pool to readonly.
     1882 +                                 */
     1883 +                                continue;
1785 1884                          }
     1885 +
     1886 +                        /*
     1887 +                         * Swap the missing vdev with the data we were
     1888 +                         * able to obtain from the MOS config.
     1889 +                         */
     1890 +                        vdev_remove_child(rvd, tvd);
     1891 +                        vdev_remove_child(mrvd, mtvd);
     1892 +
     1893 +                        vdev_add_child(rvd, mtvd);
     1894 +                        vdev_add_child(mrvd, tvd);
     1895 +
     1896 +                        spa_config_exit(spa, SCL_ALL, FTAG);
     1897 +                        vdev_load(mtvd);
     1898 +                        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
     1899 +
     1900 +                        vdev_reopen(rvd);
     1901 +                } else {
     1902 +                        if (mtvd->vdev_islog) {
     1903 +                                /*
     1904 +                                 * Load the slog device's state from the MOS
     1905 +                                 * config since it's possible that the label
     1906 +                                 * does not contain the most up-to-date
     1907 +                                 * information.
     1908 +                                 */
     1909 +                                vdev_load_log_state(tvd, mtvd);
     1910 +                                vdev_reopen(tvd);
     1911 +                        }
     1912 +
     1913 +                        /*
     1914 +                         * Per-vdev ZAP info is stored exclusively in the MOS.
     1915 +                         */
     1916 +                        spa_config_valid_zaps(tvd, mtvd);
1786 1917                  }
1787 1918          }
1788 1919  
1789      -        return (0);
     1920 +        vdev_free(mrvd);
     1921 +        spa_config_exit(spa, SCL_ALL, FTAG);
     1922 +
     1923 +        /*
     1924 +         * Ensure we were able to validate the config.
     1925 +         */
     1926 +        return (rvd->vdev_guid_sum == spa->spa_uberblock.ub_guid_sum);
1790 1927  }
1791 1928  
1792 1929  /*
1793 1930   * Check for missing log devices
1794 1931   */
1795 1932  static boolean_t
1796 1933  spa_check_logs(spa_t *spa)
1797 1934  {
1798 1935          boolean_t rv = B_FALSE;
1799 1936          dsl_pool_t *dp = spa_get_dsl(spa);
↓ open down ↓ 45 lines elided ↑ open up ↑
1845 1982          for (int c = 0; c < rvd->vdev_children; c++) {
1846 1983                  vdev_t *tvd = rvd->vdev_child[c];
1847 1984                  metaslab_group_t *mg = tvd->vdev_mg;
1848 1985  
1849 1986                  if (tvd->vdev_islog)
1850 1987                          metaslab_group_activate(mg);
1851 1988          }
1852 1989  }
1853 1990  
1854 1991  int
1855      -spa_reset_logs(spa_t *spa)
     1992 +spa_offline_log(spa_t *spa)
1856 1993  {
1857 1994          int error;
1858 1995  
1859      -        error = dmu_objset_find(spa_name(spa), zil_reset,
     1996 +        error = dmu_objset_find(spa_name(spa), zil_vdev_offline,
1860 1997              NULL, DS_FIND_CHILDREN);
1861 1998          if (error == 0) {
1862 1999                  /*
1863 2000                   * We successfully offlined the log device, sync out the
1864 2001                   * current txg so that the "stubby" block can be removed
1865 2002                   * by zil_sync().
1866 2003                   */
1867 2004                  txg_wait_synced(spa->spa_dsl_pool, 0);
1868 2005          }
1869 2006          return (error);
↓ open down ↓ 29 lines elided ↑ open up ↑
1899 2036  spa_load_verify_done(zio_t *zio)
1900 2037  {
1901 2038          blkptr_t *bp = zio->io_bp;
1902 2039          spa_load_error_t *sle = zio->io_private;
1903 2040          dmu_object_type_t type = BP_GET_TYPE(bp);
1904 2041          int error = zio->io_error;
1905 2042          spa_t *spa = zio->io_spa;
1906 2043  
1907 2044          abd_free(zio->io_abd);
1908 2045          if (error) {
1909      -                if ((BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type)) &&
1910      -                    type != DMU_OT_INTENT_LOG)
     2046 +                if (BP_IS_METADATA(bp) && type != DMU_OT_INTENT_LOG)
1911 2047                          atomic_inc_64(&sle->sle_meta_count);
1912 2048                  else
1913 2049                          atomic_inc_64(&sle->sle_data_count);
1914 2050          }
1915 2051  
1916 2052          mutex_enter(&spa->spa_scrub_lock);
1917 2053          spa->spa_scrub_inflight--;
1918 2054          cv_broadcast(&spa->spa_scrub_io_cv);
1919 2055          mutex_exit(&spa->spa_scrub_lock);
1920 2056  }
↓ open down ↓ 68 lines elided ↑ open up ↑
1989 2125              spa->spa_dsl_pool->dp_root_dir_obj, verify_dataset_name_len, NULL,
1990 2126              DS_FIND_CHILDREN);
1991 2127          dsl_pool_config_exit(spa->spa_dsl_pool, FTAG);
1992 2128          if (error != 0)
1993 2129                  return (error);
1994 2130  
1995 2131          rio = zio_root(spa, NULL, &sle,
1996 2132              ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE);
1997 2133  
1998 2134          if (spa_load_verify_metadata) {
1999      -                if (spa->spa_extreme_rewind) {
2000      -                        spa_load_note(spa, "performing a complete scan of the "
2001      -                            "pool since extreme rewind is on. This may take "
2002      -                            "a very long time.\n  (spa_load_verify_data=%u, "
2003      -                            "spa_load_verify_metadata=%u)",
2004      -                            spa_load_verify_data, spa_load_verify_metadata);
2005      -                }
2006      -                error = traverse_pool(spa, spa->spa_verify_min_txg,
     2135 +                zbookmark_phys_t zb = { 0 };
     2136 +                error = traverse_pool(spa, spa->spa_verify_min_txg, UINT64_MAX,
2007 2137                      TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA,
2008      -                    spa_load_verify_cb, rio);
     2138 +                    spa_load_verify_cb, rio, &zb);
2009 2139          }
2010 2140  
2011 2141          (void) zio_wait(rio);
2012 2142  
2013 2143          spa->spa_load_meta_errors = sle.sle_meta_count;
2014 2144          spa->spa_load_data_errors = sle.sle_data_count;
2015 2145  
2016      -        if (sle.sle_meta_count != 0 || sle.sle_data_count != 0) {
2017      -                spa_load_note(spa, "spa_load_verify found %llu metadata errors "
2018      -                    "and %llu data errors", (u_longlong_t)sle.sle_meta_count,
2019      -                    (u_longlong_t)sle.sle_data_count);
2020      -        }
2021      -
2022      -        if (spa_load_verify_dryrun ||
2023      -            (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
2024      -            sle.sle_data_count <= policy.zrp_maxdata)) {
     2146 +        if (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
     2147 +            sle.sle_data_count <= policy.zrp_maxdata) {
2025 2148                  int64_t loss = 0;
2026 2149  
2027 2150                  verify_ok = B_TRUE;
2028 2151                  spa->spa_load_txg = spa->spa_uberblock.ub_txg;
2029 2152                  spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
2030 2153  
2031 2154                  loss = spa->spa_last_ubsync_txg_ts - spa->spa_load_txg_ts;
2032 2155                  VERIFY(nvlist_add_uint64(spa->spa_load_info,
2033 2156                      ZPOOL_CONFIG_LOAD_TIME, spa->spa_load_txg_ts) == 0);
2034 2157                  VERIFY(nvlist_add_int64(spa->spa_load_info,
2035 2158                      ZPOOL_CONFIG_REWIND_TIME, loss) == 0);
2036 2159                  VERIFY(nvlist_add_uint64(spa->spa_load_info,
2037 2160                      ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count) == 0);
2038 2161          } else {
2039 2162                  spa->spa_load_max_txg = spa->spa_uberblock.ub_txg;
2040 2163          }
2041 2164  
2042      -        if (spa_load_verify_dryrun)
2043      -                return (0);
2044      -
2045 2165          if (error) {
2046 2166                  if (error != ENXIO && error != EIO)
2047 2167                          error = SET_ERROR(EIO);
2048 2168                  return (error);
2049 2169          }
2050 2170  
2051 2171          return (verify_ok ? 0 : EIO);
2052 2172  }
2053 2173  
2054 2174  /*
↓ open down ↓ 3 lines elided ↑ open up ↑
2058 2178  spa_prop_find(spa_t *spa, zpool_prop_t prop, uint64_t *val)
2059 2179  {
2060 2180          (void) zap_lookup(spa->spa_meta_objset, spa->spa_pool_props_object,
2061 2181              zpool_prop_to_name(prop), sizeof (uint64_t), 1, val);
2062 2182  }
2063 2183  
2064 2184  /*
2065 2185   * Find a value in the pool directory object.
2066 2186   */
2067 2187  static int
2068      -spa_dir_prop(spa_t *spa, const char *name, uint64_t *val, boolean_t log_enoent)
     2188 +spa_dir_prop(spa_t *spa, const char *name, uint64_t *val)
2069 2189  {
2070      -        int error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2071      -            name, sizeof (uint64_t), 1, val);
     2190 +        return (zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
     2191 +            name, sizeof (uint64_t), 1, val));
     2192 +}
2072 2193  
2073      -        if (error != 0 && (error != ENOENT || log_enoent)) {
2074      -                spa_load_failed(spa, "couldn't get '%s' value in MOS directory "
2075      -                    "[error=%d]", name, error);
     2194 +static void
     2195 +spa_set_ddt_classes(spa_t *spa, int desegregation)
     2196 +{
     2197 +        /*
     2198 +         * if desegregation is turned on then set up ddt_class restrictions
     2199 +         */
     2200 +        if (desegregation) {
     2201 +                spa->spa_ddt_class_min = DDT_CLASS_DUPLICATE;
     2202 +                spa->spa_ddt_class_max = DDT_CLASS_DUPLICATE;
     2203 +        } else {
     2204 +                spa->spa_ddt_class_min = DDT_CLASS_DITTO;
     2205 +                spa->spa_ddt_class_max = DDT_CLASS_UNIQUE;
2076 2206          }
2077      -
2078      -        return (error);
2079 2207  }
2080 2208  
2081 2209  static int
2082 2210  spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err)
2083 2211  {
2084 2212          vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux);
2085      -        return (SET_ERROR(err));
     2213 +        return (err);
2086 2214  }
2087 2215  
2088      -static void
2089      -spa_spawn_aux_threads(spa_t *spa)
2090      -{
2091      -        ASSERT(spa_writeable(spa));
2092      -
2093      -        ASSERT(MUTEX_HELD(&spa_namespace_lock));
2094      -
2095      -        spa_start_indirect_condensing_thread(spa);
2096      -}
2097      -
2098 2216  /*
2099 2217   * Fix up config after a partly-completed split.  This is done with the
2100 2218   * ZPOOL_CONFIG_SPLIT nvlist.  Both the splitting pool and the split-off
2101 2219   * pool have that entry in their config, but only the splitting one contains
2102 2220   * a list of all the guids of the vdevs that are being split off.
2103 2221   *
2104 2222   * This function determines what to do with that list: either rejoin
2105 2223   * all the disks to the pool, or complete the splitting process.  To attempt
2106 2224   * the rejoin, each disk that is offlined is marked online again, and
2107 2225   * we do a reopen() call.  If the vdev label for every disk that was
↓ open down ↓ 63 lines elided ↑ open up ↑
2171 2289                  for (i = 0; i < gcount; i++)
2172 2290                          if (vd[i] != NULL)
2173 2291                                  vdev_split(vd[i]);
2174 2292                  vdev_reopen(spa->spa_root_vdev);
2175 2293          }
2176 2294  
2177 2295          kmem_free(vd, gcount * sizeof (vdev_t *));
2178 2296  }
2179 2297  
2180 2298  static int
2181      -spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type)
     2299 +spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type,
     2300 +    boolean_t mosconfig)
2182 2301  {
     2302 +        nvlist_t *config = spa->spa_config;
2183 2303          char *ereport = FM_EREPORT_ZFS_POOL;
     2304 +        char *comment;
2184 2305          int error;
     2306 +        uint64_t pool_guid;
     2307 +        nvlist_t *nvl;
2185 2308  
2186      -        spa->spa_load_state = state;
     2309 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid))
     2310 +                return (SET_ERROR(EINVAL));
2187 2311  
2188      -        gethrestime(&spa->spa_loaded_ts);
2189      -        error = spa_load_impl(spa, type, &ereport, B_FALSE);
     2312 +        ASSERT(spa->spa_comment == NULL);
     2313 +        if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
     2314 +                spa->spa_comment = spa_strdup(comment);
2190 2315  
2191 2316          /*
     2317 +         * Versioning wasn't explicitly added to the label until later, so if
     2318 +         * it's not present treat it as the initial version.
     2319 +         */
     2320 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
     2321 +            &spa->spa_ubsync.ub_version) != 0)
     2322 +                spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
     2323 +
     2324 +        (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
     2325 +            &spa->spa_config_txg);
     2326 +
     2327 +        if ((state == SPA_LOAD_IMPORT || state == SPA_LOAD_TRYIMPORT) &&
     2328 +            spa_guid_exists(pool_guid, 0)) {
     2329 +                error = SET_ERROR(EEXIST);
     2330 +        } else {
     2331 +                spa->spa_config_guid = pool_guid;
     2332 +
     2333 +                if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT,
     2334 +                    &nvl) == 0) {
     2335 +                        VERIFY(nvlist_dup(nvl, &spa->spa_config_splitting,
     2336 +                            KM_SLEEP) == 0);
     2337 +                }
     2338 +
     2339 +                nvlist_free(spa->spa_load_info);
     2340 +                spa->spa_load_info = fnvlist_alloc();
     2341 +
     2342 +                gethrestime(&spa->spa_loaded_ts);
     2343 +                error = spa_load_impl(spa, pool_guid, config, state, type,
     2344 +                    mosconfig, &ereport);
     2345 +        }
     2346 +
     2347 +        /*
2192 2348           * Don't count references from objsets that are already closed
2193 2349           * and are making their way through the eviction process.
2194 2350           */
2195 2351          spa_evicting_os_wait(spa);
2196 2352          spa->spa_minref = refcount_count(&spa->spa_refcount);
2197 2353          if (error) {
2198 2354                  if (error != EEXIST) {
2199 2355                          spa->spa_loaded_ts.tv_sec = 0;
2200 2356                          spa->spa_loaded_ts.tv_nsec = 0;
2201 2357                  }
2202 2358                  if (error != EBADF) {
2203 2359                          zfs_ereport_post(ereport, spa, NULL, NULL, 0, 0);
2204 2360                  }
2205 2361          }
2206 2362          spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE;
2207 2363          spa->spa_ena = 0;
2208      -
2209 2364          return (error);
2210 2365  }
2211 2366  
2212 2367  /*
2213 2368   * Count the number of per-vdev ZAPs associated with all of the vdevs in the
2214 2369   * vdev tree rooted in the given vd, and ensure that each ZAP is present in the
2215 2370   * spa's per-vdev ZAP list.
2216 2371   */
2217 2372  static uint64_t
2218 2373  vdev_count_verify_zaps(vdev_t *vd)
↓ open down ↓ 11 lines elided ↑ open up ↑
2230 2385                      spa->spa_all_vdev_zaps, vd->vdev_leaf_zap));
2231 2386          }
2232 2387  
2233 2388          for (uint64_t i = 0; i < vd->vdev_children; i++) {
2234 2389                  total += vdev_count_verify_zaps(vd->vdev_child[i]);
2235 2390          }
2236 2391  
2237 2392          return (total);
2238 2393  }
2239 2394  
     2395 +/*
     2396 + * Load an existing storage pool, using the pool's builtin spa_config as a
     2397 + * source of configuration information.
     2398 + */
2240 2399  static int
2241      -spa_verify_host(spa_t *spa, nvlist_t *mos_config)
     2400 +spa_load_impl(spa_t *spa, uint64_t pool_guid, nvlist_t *config,
     2401 +    spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
     2402 +    char **ereport)
2242 2403  {
2243      -        uint64_t hostid;
2244      -        char *hostname;
2245      -        uint64_t myhostid = 0;
2246      -
2247      -        if (!spa_is_root(spa) && nvlist_lookup_uint64(mos_config,
2248      -            ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
2249      -                hostname = fnvlist_lookup_string(mos_config,
2250      -                    ZPOOL_CONFIG_HOSTNAME);
2251      -
2252      -                myhostid = zone_get_hostid(NULL);
2253      -
2254      -                if (hostid != 0 && myhostid != 0 && hostid != myhostid) {
2255      -                        cmn_err(CE_WARN, "pool '%s' could not be "
2256      -                            "loaded as it was last accessed by "
2257      -                            "another system (host: %s hostid: 0x%llx). "
2258      -                            "See: http://illumos.org/msg/ZFS-8000-EY",
2259      -                            spa_name(spa), hostname, (u_longlong_t)hostid);
2260      -                        spa_load_failed(spa, "hostid verification failed: pool "
2261      -                            "last accessed by host: %s (hostid: 0x%llx)",
2262      -                            hostname, (u_longlong_t)hostid);
2263      -                        return (SET_ERROR(EBADF));
2264      -                }
2265      -        }
2266      -
2267      -        return (0);
2268      -}
2269      -
2270      -static int
2271      -spa_ld_parse_config(spa_t *spa, spa_import_type_t type)
2272      -{
2273 2404          int error = 0;
2274      -        nvlist_t *nvtree, *nvl, *config = spa->spa_config;
2275      -        int parse;
     2405 +        nvlist_t *nvroot = NULL;
     2406 +        nvlist_t *label;
2276 2407          vdev_t *rvd;
2277      -        uint64_t pool_guid;
2278      -        char *comment;
     2408 +        uberblock_t *ub = &spa->spa_uberblock;
     2409 +        uint64_t children, config_cache_txg = spa->spa_config_txg;
     2410 +        int orig_mode = spa->spa_mode;
     2411 +        int parse;
     2412 +        uint64_t obj;
     2413 +        boolean_t missing_feat_write = B_FALSE;
     2414 +        spa_meta_placement_t *mp;
2279 2415  
2280 2416          /*
2281      -         * Versioning wasn't explicitly added to the label until later, so if
2282      -         * it's not present treat it as the initial version.
     2417 +         * If this is an untrusted config, access the pool in read-only mode.
     2418 +         * This prevents things like resilvering recently removed devices.
2283 2419           */
2284      -        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
2285      -            &spa->spa_ubsync.ub_version) != 0)
2286      -                spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
     2420 +        if (!mosconfig)
     2421 +                spa->spa_mode = FREAD;
2287 2422  
2288      -        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) {
2289      -                spa_load_failed(spa, "invalid config provided: '%s' missing",
2290      -                    ZPOOL_CONFIG_POOL_GUID);
2291      -                return (SET_ERROR(EINVAL));
2292      -        }
     2423 +        ASSERT(MUTEX_HELD(&spa_namespace_lock));
2293 2424  
2294      -        if ((spa->spa_load_state == SPA_LOAD_IMPORT || spa->spa_load_state ==
2295      -            SPA_LOAD_TRYIMPORT) && spa_guid_exists(pool_guid, 0)) {
2296      -                spa_load_failed(spa, "a pool with guid %llu is already open",
2297      -                    (u_longlong_t)pool_guid);
2298      -                return (SET_ERROR(EEXIST));
2299      -        }
     2425 +        spa->spa_load_state = state;
2300 2426  
2301      -        spa->spa_config_guid = pool_guid;
2302      -
2303      -        nvlist_free(spa->spa_load_info);
2304      -        spa->spa_load_info = fnvlist_alloc();
2305      -
2306      -        ASSERT(spa->spa_comment == NULL);
2307      -        if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
2308      -                spa->spa_comment = spa_strdup(comment);
2309      -
2310      -        (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
2311      -            &spa->spa_config_txg);
2312      -
2313      -        if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) == 0)
2314      -                spa->spa_config_splitting = fnvlist_dup(nvl);
2315      -
2316      -        if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvtree)) {
2317      -                spa_load_failed(spa, "invalid config provided: '%s' missing",
2318      -                    ZPOOL_CONFIG_VDEV_TREE);
     2427 +        if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvroot))
2319 2428                  return (SET_ERROR(EINVAL));
2320      -        }
2321 2429  
     2430 +        parse = (type == SPA_IMPORT_EXISTING ?
     2431 +            VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
     2432 +
2322 2433          /*
2323 2434           * Create "The Godfather" zio to hold all async IOs
2324 2435           */
2325 2436          spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
2326 2437              KM_SLEEP);
2327 2438          for (int i = 0; i < max_ncpus; i++) {
2328 2439                  spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
2329 2440                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
2330 2441                      ZIO_FLAG_GODFATHER);
2331 2442          }
2332 2443  
2333 2444          /*
2334 2445           * Parse the configuration into a vdev tree.  We explicitly set the
2335 2446           * value that will be returned by spa_version() since parsing the
2336 2447           * configuration requires knowing the version number.
2337 2448           */
2338 2449          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2339      -        parse = (type == SPA_IMPORT_EXISTING ?
2340      -            VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
2341      -        error = spa_config_parse(spa, &rvd, nvtree, NULL, 0, parse);
     2450 +        error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, parse);
2342 2451          spa_config_exit(spa, SCL_ALL, FTAG);
2343 2452  
2344      -        if (error != 0) {
2345      -                spa_load_failed(spa, "unable to parse config [error=%d]",
2346      -                    error);
     2453 +        if (error != 0)
2347 2454                  return (error);
2348      -        }
2349 2455  
2350 2456          ASSERT(spa->spa_root_vdev == rvd);
2351 2457          ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
2352 2458          ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT);
2353 2459  
2354 2460          if (type != SPA_IMPORT_ASSEMBLE) {
2355 2461                  ASSERT(spa_guid(spa) == pool_guid);
2356 2462          }
2357 2463  
2358      -        return (0);
2359      -}
2360      -
2361      -/*
2362      - * Recursively open all vdevs in the vdev tree. This function is called twice:
2363      - * first with the untrusted config, then with the trusted config.
2364      - */
2365      -static int
2366      -spa_ld_open_vdevs(spa_t *spa)
2367      -{
2368      -        int error = 0;
2369      -
2370 2464          /*
2371      -         * spa_missing_tvds_allowed defines how many top-level vdevs can be
2372      -         * missing/unopenable for the root vdev to be still considered openable.
     2465 +         * Try to open all vdevs, loading each label in the process.
2373 2466           */
2374      -        if (spa->spa_trust_config) {
2375      -                spa->spa_missing_tvds_allowed = zfs_max_missing_tvds;
2376      -        } else if (spa->spa_config_source == SPA_CONFIG_SRC_CACHEFILE) {
2377      -                spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_cachefile;
2378      -        } else if (spa->spa_config_source == SPA_CONFIG_SRC_SCAN) {
2379      -                spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_scan;
2380      -        } else {
2381      -                spa->spa_missing_tvds_allowed = 0;
2382      -        }
2383      -
2384      -        spa->spa_missing_tvds_allowed =
2385      -            MAX(zfs_max_missing_tvds, spa->spa_missing_tvds_allowed);
2386      -
2387 2467          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2388      -        error = vdev_open(spa->spa_root_vdev);
     2468 +        error = vdev_open(rvd);
2389 2469          spa_config_exit(spa, SCL_ALL, FTAG);
     2470 +        if (error != 0)
     2471 +                return (error);
2390 2472  
2391      -        if (spa->spa_missing_tvds != 0) {
2392      -                spa_load_note(spa, "vdev tree has %lld missing top-level "
2393      -                    "vdevs.", (u_longlong_t)spa->spa_missing_tvds);
2394      -                if (spa->spa_trust_config && (spa->spa_mode & FWRITE)) {
2395      -                        /*
2396      -                         * Although theoretically we could allow users to open
2397      -                         * incomplete pools in RW mode, we'd need to add a lot
2398      -                         * of extra logic (e.g. adjust pool space to account
2399      -                         * for missing vdevs).
2400      -                         * This limitation also prevents users from accidentally
2401      -                         * opening the pool in RW mode during data recovery and
2402      -                         * damaging it further.
2403      -                         */
2404      -                        spa_load_note(spa, "pools with missing top-level "
2405      -                            "vdevs can only be opened in read-only mode.");
2406      -                        error = SET_ERROR(ENXIO);
2407      -                } else {
2408      -                        spa_load_note(spa, "current settings allow for maximum "
2409      -                            "%lld missing top-level vdevs at this stage.",
2410      -                            (u_longlong_t)spa->spa_missing_tvds_allowed);
2411      -                }
2412      -        }
2413      -        if (error != 0) {
2414      -                spa_load_failed(spa, "unable to open vdev tree [error=%d]",
2415      -                    error);
2416      -        }
2417      -        if (spa->spa_missing_tvds != 0 || error != 0)
2418      -                vdev_dbgmsg_print_tree(spa->spa_root_vdev, 2);
     2473 +        /*
     2474 +         * We need to validate the vdev labels against the configuration that
     2475 +         * we have in hand, which is dependent on the setting of mosconfig. If
     2476 +         * mosconfig is true then we're validating the vdev labels based on
     2477 +         * that config.  Otherwise, we're validating against the cached config
     2478 +         * (zpool.cache) that was read when we loaded the zfs module, and then
     2479 +         * later we will recursively call spa_load() and validate against
     2480 +         * the vdev config.
     2481 +         *
     2482 +         * If we're assembling a new pool that's been split off from an
     2483 +         * existing pool, the labels haven't yet been updated so we skip
     2484 +         * validation for now.
     2485 +         */
     2486 +        if (type != SPA_IMPORT_ASSEMBLE) {
     2487 +                spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
     2488 +                error = vdev_validate(rvd, mosconfig);
     2489 +                spa_config_exit(spa, SCL_ALL, FTAG);
2419 2490  
2420      -        return (error);
2421      -}
     2491 +                if (error != 0)
     2492 +                        return (error);
2422 2493  
2423      -/*
2424      - * We need to validate the vdev labels against the configuration that
2425      - * we have in hand. This function is called twice: first with an untrusted
2426      - * config, then with a trusted config. The validation is more strict when the
2427      - * config is trusted.
2428      - */
2429      -static int
2430      -spa_ld_validate_vdevs(spa_t *spa)
2431      -{
2432      -        int error = 0;
2433      -        vdev_t *rvd = spa->spa_root_vdev;
2434      -
2435      -        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2436      -        error = vdev_validate(rvd);
2437      -        spa_config_exit(spa, SCL_ALL, FTAG);
2438      -
2439      -        if (error != 0) {
2440      -                spa_load_failed(spa, "vdev_validate failed [error=%d]", error);
2441      -                return (error);
     2494 +                if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
     2495 +                        return (SET_ERROR(ENXIO));
2442 2496          }
2443 2497  
2444      -        if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) {
2445      -                spa_load_failed(spa, "cannot open vdev tree after invalidating "
2446      -                    "some vdevs");
2447      -                vdev_dbgmsg_print_tree(rvd, 2);
2448      -                return (SET_ERROR(ENXIO));
2449      -        }
2450      -
2451      -        return (0);
2452      -}
2453      -
2454      -static int
2455      -spa_ld_select_uberblock(spa_t *spa, spa_import_type_t type)
2456      -{
2457      -        vdev_t *rvd = spa->spa_root_vdev;
2458      -        nvlist_t *label;
2459      -        uberblock_t *ub = &spa->spa_uberblock;
2460      -
2461 2498          /*
2462 2499           * Find the best uberblock.
2463 2500           */
2464 2501          vdev_uberblock_load(rvd, ub, &label);
2465 2502  
2466 2503          /*
2467 2504           * If we weren't able to find a single valid uberblock, return failure.
2468 2505           */
2469 2506          if (ub->ub_txg == 0) {
2470 2507                  nvlist_free(label);
2471      -                spa_load_failed(spa, "no valid uberblock found");
2472 2508                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO));
2473 2509          }
2474 2510  
2475      -        spa_load_note(spa, "using uberblock with txg=%llu",
2476      -            (u_longlong_t)ub->ub_txg);
2477      -
2478 2511          /*
2479 2512           * If the pool has an unsupported version we can't open it.
2480 2513           */
2481 2514          if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) {
2482 2515                  nvlist_free(label);
2483      -                spa_load_failed(spa, "version %llu is not supported",
2484      -                    (u_longlong_t)ub->ub_version);
2485 2516                  return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP));
2486 2517          }
2487 2518  
2488 2519          if (ub->ub_version >= SPA_VERSION_FEATURES) {
2489 2520                  nvlist_t *features;
2490 2521  
2491 2522                  /*
2492 2523                   * If we weren't able to find what's necessary for reading the
2493 2524                   * MOS in the label, return failure.
2494 2525                   */
2495      -                if (label == NULL) {
2496      -                        spa_load_failed(spa, "label config unavailable");
2497      -                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2498      -                            ENXIO));
2499      -                }
2500      -
2501      -                if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_FEATURES_FOR_READ,
2502      -                    &features) != 0) {
     2526 +                if (label == NULL || nvlist_lookup_nvlist(label,
     2527 +                    ZPOOL_CONFIG_FEATURES_FOR_READ, &features) != 0) {
2503 2528                          nvlist_free(label);
2504      -                        spa_load_failed(spa, "invalid label: '%s' missing",
2505      -                            ZPOOL_CONFIG_FEATURES_FOR_READ);
2506 2529                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2507 2530                              ENXIO));
2508 2531                  }
2509 2532  
2510 2533                  /*
2511 2534                   * Update our in-core representation with the definitive values
2512 2535                   * from the label.
2513 2536                   */
2514 2537                  nvlist_free(spa->spa_label_features);
2515 2538                  VERIFY(nvlist_dup(features, &spa->spa_label_features, 0) == 0);
↓ open down ↓ 18 lines elided ↑ open up ↑
2534 2557                          if (!zfeature_is_supported(nvpair_name(nvp))) {
2535 2558                                  VERIFY(nvlist_add_string(unsup_feat,
2536 2559                                      nvpair_name(nvp), "") == 0);
2537 2560                          }
2538 2561                  }
2539 2562  
2540 2563                  if (!nvlist_empty(unsup_feat)) {
2541 2564                          VERIFY(nvlist_add_nvlist(spa->spa_load_info,
2542 2565                              ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0);
2543 2566                          nvlist_free(unsup_feat);
2544      -                        spa_load_failed(spa, "some features are unsupported");
2545 2567                          return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2546 2568                              ENOTSUP));
2547 2569                  }
2548 2570  
2549 2571                  nvlist_free(unsup_feat);
2550 2572          }
2551 2573  
     2574 +        /*
     2575 +         * If the vdev guid sum doesn't match the uberblock, we have an
     2576 +         * incomplete configuration.  We first check to see if the pool
     2577 +         * is aware of the complete config (i.e ZPOOL_CONFIG_VDEV_CHILDREN).
     2578 +         * If it is, defer the vdev_guid_sum check till later so we
     2579 +         * can handle missing vdevs.
     2580 +         */
     2581 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN,
     2582 +            &children) != 0 && mosconfig && type != SPA_IMPORT_ASSEMBLE &&
     2583 +            rvd->vdev_guid_sum != ub->ub_guid_sum)
     2584 +                return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
     2585 +
2552 2586          if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) {
2553 2587                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2554      -                spa_try_repair(spa, spa->spa_config);
     2588 +                spa_try_repair(spa, config);
2555 2589                  spa_config_exit(spa, SCL_ALL, FTAG);
2556 2590                  nvlist_free(spa->spa_config_splitting);
2557 2591                  spa->spa_config_splitting = NULL;
2558 2592          }
2559 2593  
2560 2594          /*
2561 2595           * Initialize internal SPA structures.
2562 2596           */
2563 2597          spa->spa_state = POOL_STATE_ACTIVE;
2564 2598          spa->spa_ubsync = spa->spa_uberblock;
2565 2599          spa->spa_verify_min_txg = spa->spa_extreme_rewind ?
2566 2600              TXG_INITIAL - 1 : spa_last_synced_txg(spa) - TXG_DEFER_SIZE - 1;
2567 2601          spa->spa_first_txg = spa->spa_last_ubsync_txg ?
2568 2602              spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1;
2569 2603          spa->spa_claim_max_txg = spa->spa_first_txg;
2570 2604          spa->spa_prev_software_version = ub->ub_software_version;
2571 2605  
2572      -        return (0);
2573      -}
2574      -
2575      -static int
2576      -spa_ld_open_rootbp(spa_t *spa)
2577      -{
2578      -        int error = 0;
2579      -        vdev_t *rvd = spa->spa_root_vdev;
2580      -
2581 2606          error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool);
2582      -        if (error != 0) {
2583      -                spa_load_failed(spa, "unable to open rootbp in dsl_pool_init "
2584      -                    "[error=%d]", error);
     2607 +        if (error)
2585 2608                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2586      -        }
2587 2609          spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset;
2588 2610  
2589      -        return (0);
2590      -}
2591      -
2592      -static int
2593      -spa_ld_load_trusted_config(spa_t *spa, spa_import_type_t type,
2594      -    boolean_t reloading)
2595      -{
2596      -        vdev_t *mrvd, *rvd = spa->spa_root_vdev;
2597      -        nvlist_t *nv, *mos_config, *policy;
2598      -        int error = 0, copy_error;
2599      -        uint64_t healthy_tvds, healthy_tvds_mos;
2600      -        uint64_t mos_config_txg;
2601      -
2602      -        if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object, B_TRUE)
2603      -            != 0)
     2611 +        if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object) != 0)
2604 2612                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2605 2613  
2606      -        /*
2607      -         * If we're assembling a pool from a split, the config provided is
2608      -         * already trusted so there is nothing to do.
2609      -         */
2610      -        if (type == SPA_IMPORT_ASSEMBLE)
2611      -                return (0);
2612      -
2613      -        healthy_tvds = spa_healthy_core_tvds(spa);
2614      -
2615      -        if (load_nvlist(spa, spa->spa_config_object, &mos_config)
2616      -            != 0) {
2617      -                spa_load_failed(spa, "unable to retrieve MOS config");
2618      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2619      -        }
2620      -
2621      -        /*
2622      -         * If we are doing an open, pool owner wasn't verified yet, thus do
2623      -         * the verification here.
2624      -         */
2625      -        if (spa->spa_load_state == SPA_LOAD_OPEN) {
2626      -                error = spa_verify_host(spa, mos_config);
2627      -                if (error != 0) {
2628      -                        nvlist_free(mos_config);
2629      -                        return (error);
2630      -                }
2631      -        }
2632      -
2633      -        nv = fnvlist_lookup_nvlist(mos_config, ZPOOL_CONFIG_VDEV_TREE);
2634      -
2635      -        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2636      -
2637      -        /*
2638      -         * Build a new vdev tree from the trusted config
2639      -         */
2640      -        VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
2641      -
2642      -        /*
2643      -         * Vdev paths in the MOS may be obsolete. If the untrusted config was
2644      -         * obtained by scanning /dev/dsk, then it will have the right vdev
2645      -         * paths. We update the trusted MOS config with this information.
2646      -         * We first try to copy the paths with vdev_copy_path_strict, which
2647      -         * succeeds only when both configs have exactly the same vdev tree.
2648      -         * If that fails, we fall back to a more flexible method that has a
2649      -         * best effort policy.
2650      -         */
2651      -        copy_error = vdev_copy_path_strict(rvd, mrvd);
2652      -        if (copy_error != 0 || spa_load_print_vdev_tree) {
2653      -                spa_load_note(spa, "provided vdev tree:");
2654      -                vdev_dbgmsg_print_tree(rvd, 2);
2655      -                spa_load_note(spa, "MOS vdev tree:");
2656      -                vdev_dbgmsg_print_tree(mrvd, 2);
2657      -        }
2658      -        if (copy_error != 0) {
2659      -                spa_load_note(spa, "vdev_copy_path_strict failed, falling "
2660      -                    "back to vdev_copy_path_relaxed");
2661      -                vdev_copy_path_relaxed(rvd, mrvd);
2662      -        }
2663      -
2664      -        vdev_close(rvd);
2665      -        vdev_free(rvd);
2666      -        spa->spa_root_vdev = mrvd;
2667      -        rvd = mrvd;
2668      -        spa_config_exit(spa, SCL_ALL, FTAG);
2669      -
2670      -        /*
2671      -         * We will use spa_config if we decide to reload the spa or if spa_load
2672      -         * fails and we rewind. We must thus regenerate the config using the
2673      -         * MOS information with the updated paths. Rewind policy is an import
2674      -         * setting and is not in the MOS. We copy it over to our new, trusted
2675      -         * config.
2676      -         */
2677      -        mos_config_txg = fnvlist_lookup_uint64(mos_config,
2678      -            ZPOOL_CONFIG_POOL_TXG);
2679      -        nvlist_free(mos_config);
2680      -        mos_config = spa_config_generate(spa, NULL, mos_config_txg, B_FALSE);
2681      -        if (nvlist_lookup_nvlist(spa->spa_config, ZPOOL_REWIND_POLICY,
2682      -            &policy) == 0)
2683      -                fnvlist_add_nvlist(mos_config, ZPOOL_REWIND_POLICY, policy);
2684      -        spa_config_set(spa, mos_config);
2685      -        spa->spa_config_source = SPA_CONFIG_SRC_MOS;
2686      -
2687      -        /*
2688      -         * Now that we got the config from the MOS, we should be more strict
2689      -         * in checking blkptrs and can make assumptions about the consistency
2690      -         * of the vdev tree. spa_trust_config must be set to true before opening
2691      -         * vdevs in order for them to be writeable.
2692      -         */
2693      -        spa->spa_trust_config = B_TRUE;
2694      -
2695      -        /*
2696      -         * Open and validate the new vdev tree
2697      -         */
2698      -        error = spa_ld_open_vdevs(spa);
2699      -        if (error != 0)
2700      -                return (error);
2701      -
2702      -        error = spa_ld_validate_vdevs(spa);
2703      -        if (error != 0)
2704      -                return (error);
2705      -
2706      -        if (copy_error != 0 || spa_load_print_vdev_tree) {
2707      -                spa_load_note(spa, "final vdev tree:");
2708      -                vdev_dbgmsg_print_tree(rvd, 2);
2709      -        }
2710      -
2711      -        if (spa->spa_load_state != SPA_LOAD_TRYIMPORT &&
2712      -            !spa->spa_extreme_rewind && zfs_max_missing_tvds == 0) {
2713      -                /*
2714      -                 * Sanity check to make sure that we are indeed loading the
2715      -                 * latest uberblock. If we missed SPA_SYNC_MIN_VDEVS tvds
2716      -                 * in the config provided and they happened to be the only ones
2717      -                 * to have the latest uberblock, we could involuntarily perform
2718      -                 * an extreme rewind.
2719      -                 */
2720      -                healthy_tvds_mos = spa_healthy_core_tvds(spa);
2721      -                if (healthy_tvds_mos - healthy_tvds >=
2722      -                    SPA_SYNC_MIN_VDEVS) {
2723      -                        spa_load_note(spa, "config provided misses too many "
2724      -                            "top-level vdevs compared to MOS (%lld vs %lld). ",
2725      -                            (u_longlong_t)healthy_tvds,
2726      -                            (u_longlong_t)healthy_tvds_mos);
2727      -                        spa_load_note(spa, "vdev tree:");
2728      -                        vdev_dbgmsg_print_tree(rvd, 2);
2729      -                        if (reloading) {
2730      -                                spa_load_failed(spa, "config was already "
2731      -                                    "provided from MOS. Aborting.");
2732      -                                return (spa_vdev_err(rvd,
2733      -                                    VDEV_AUX_CORRUPT_DATA, EIO));
2734      -                        }
2735      -                        spa_load_note(spa, "spa must be reloaded using MOS "
2736      -                            "config");
2737      -                        return (SET_ERROR(EAGAIN));
2738      -                }
2739      -        }
2740      -
2741      -        error = spa_check_for_missing_logs(spa);
2742      -        if (error != 0)
2743      -                return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
2744      -
2745      -        if (rvd->vdev_guid_sum != spa->spa_uberblock.ub_guid_sum) {
2746      -                spa_load_failed(spa, "uberblock guid sum doesn't match MOS "
2747      -                    "guid sum (%llu != %llu)",
2748      -                    (u_longlong_t)spa->spa_uberblock.ub_guid_sum,
2749      -                    (u_longlong_t)rvd->vdev_guid_sum);
2750      -                return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
2751      -                    ENXIO));
2752      -        }
2753      -
2754      -        return (0);
2755      -}
2756      -
2757      -static int
2758      -spa_ld_open_indirect_vdev_metadata(spa_t *spa)
2759      -{
2760      -        int error = 0;
2761      -        vdev_t *rvd = spa->spa_root_vdev;
2762      -
2763      -        /*
2764      -         * Everything that we read before spa_remove_init() must be stored
2765      -         * on concreted vdevs.  Therefore we do this as early as possible.
2766      -         */
2767      -        error = spa_remove_init(spa);
2768      -        if (error != 0) {
2769      -                spa_load_failed(spa, "spa_remove_init failed [error=%d]",
2770      -                    error);
2771      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2772      -        }
2773      -
2774      -        /*
2775      -         * Retrieve information needed to condense indirect vdev mappings.
2776      -         */
2777      -        error = spa_condense_init(spa);
2778      -        if (error != 0) {
2779      -                spa_load_failed(spa, "spa_condense_init failed [error=%d]",
2780      -                    error);
2781      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
2782      -        }
2783      -
2784      -        return (0);
2785      -}
2786      -
2787      -static int
2788      -spa_ld_check_features(spa_t *spa, boolean_t *missing_feat_writep)
2789      -{
2790      -        int error = 0;
2791      -        vdev_t *rvd = spa->spa_root_vdev;
2792      -
2793 2614          if (spa_version(spa) >= SPA_VERSION_FEATURES) {
2794 2615                  boolean_t missing_feat_read = B_FALSE;
2795 2616                  nvlist_t *unsup_feat, *enabled_feat;
2796 2617  
2797 2618                  if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ,
2798      -                    &spa->spa_feat_for_read_obj, B_TRUE) != 0) {
     2619 +                    &spa->spa_feat_for_read_obj) != 0) {
2799 2620                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2800 2621                  }
2801 2622  
2802 2623                  if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE,
2803      -                    &spa->spa_feat_for_write_obj, B_TRUE) != 0) {
     2624 +                    &spa->spa_feat_for_write_obj) != 0) {
2804 2625                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2805 2626                  }
2806 2627  
2807 2628                  if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS,
2808      -                    &spa->spa_feat_desc_obj, B_TRUE) != 0) {
     2629 +                    &spa->spa_feat_desc_obj) != 0) {
2809 2630                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2810 2631                  }
2811 2632  
2812 2633                  enabled_feat = fnvlist_alloc();
2813 2634                  unsup_feat = fnvlist_alloc();
2814 2635  
2815 2636                  if (!spa_features_check(spa, B_FALSE,
2816 2637                      unsup_feat, enabled_feat))
2817 2638                          missing_feat_read = B_TRUE;
2818 2639  
2819      -                if (spa_writeable(spa) ||
2820      -                    spa->spa_load_state == SPA_LOAD_TRYIMPORT) {
     2640 +                if (spa_writeable(spa) || state == SPA_LOAD_TRYIMPORT) {
2821 2641                          if (!spa_features_check(spa, B_TRUE,
2822 2642                              unsup_feat, enabled_feat)) {
2823      -                                *missing_feat_writep = B_TRUE;
     2643 +                                missing_feat_write = B_TRUE;
2824 2644                          }
2825 2645                  }
2826 2646  
2827 2647                  fnvlist_add_nvlist(spa->spa_load_info,
2828 2648                      ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
2829 2649  
2830 2650                  if (!nvlist_empty(unsup_feat)) {
2831 2651                          fnvlist_add_nvlist(spa->spa_load_info,
2832 2652                              ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
2833 2653                  }
↓ open down ↓ 18 lines elided ↑ open up ↑
2852 2672                   * mode but not read-write mode, it is displayed as unavailable
2853 2673                   * in userland with a special note that the pool is actually
2854 2674                   * available for open in read-only mode.
2855 2675                   *
2856 2676                   * As a result, if the state is SPA_LOAD_TRYIMPORT and we are
2857 2677                   * missing a feature for write, we must first determine whether
2858 2678                   * the pool can be opened read-only before returning to
2859 2679                   * userland in order to know whether to display the
2860 2680                   * abovementioned note.
2861 2681                   */
2862      -                if (missing_feat_read || (*missing_feat_writep &&
     2682 +                if (missing_feat_read || (missing_feat_write &&
2863 2683                      spa_writeable(spa))) {
2864      -                        spa_load_failed(spa, "pool uses unsupported features");
2865 2684                          return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2866 2685                              ENOTSUP));
2867 2686                  }
2868 2687  
2869 2688                  /*
2870 2689                   * Load refcounts for ZFS features from disk into an in-memory
2871 2690                   * cache during SPA initialization.
2872 2691                   */
2873 2692                  for (spa_feature_t i = 0; i < SPA_FEATURES; i++) {
2874 2693                          uint64_t refcount;
2875 2694  
2876 2695                          error = feature_get_refcount_from_disk(spa,
2877 2696                              &spa_feature_table[i], &refcount);
2878 2697                          if (error == 0) {
2879 2698                                  spa->spa_feat_refcount_cache[i] = refcount;
2880 2699                          } else if (error == ENOTSUP) {
2881 2700                                  spa->spa_feat_refcount_cache[i] =
2882 2701                                      SPA_FEATURE_DISABLED;
2883 2702                          } else {
2884      -                                spa_load_failed(spa, "error getting refcount "
2885      -                                    "for feature %s [error=%d]",
2886      -                                    spa_feature_table[i].fi_guid, error);
2887 2703                                  return (spa_vdev_err(rvd,
2888 2704                                      VDEV_AUX_CORRUPT_DATA, EIO));
2889 2705                          }
2890 2706                  }
2891 2707          }
2892 2708  
2893 2709          if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) {
2894 2710                  if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG,
2895      -                    &spa->spa_feat_enabled_txg_obj, B_TRUE) != 0)
     2711 +                    &spa->spa_feat_enabled_txg_obj) != 0)
2896 2712                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2897 2713          }
2898 2714  
2899      -        return (0);
2900      -}
2901      -
2902      -static int
2903      -spa_ld_load_special_directories(spa_t *spa)
2904      -{
2905      -        int error = 0;
2906      -        vdev_t *rvd = spa->spa_root_vdev;
2907      -
2908 2715          spa->spa_is_initializing = B_TRUE;
2909 2716          error = dsl_pool_open(spa->spa_dsl_pool);
2910 2717          spa->spa_is_initializing = B_FALSE;
2911      -        if (error != 0) {
2912      -                spa_load_failed(spa, "dsl_pool_open failed [error=%d]", error);
     2718 +        if (error != 0)
2913 2719                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2914      -        }
2915 2720  
2916      -        return (0);
2917      -}
     2721 +        if (!mosconfig) {
     2722 +                uint64_t hostid;
     2723 +                nvlist_t *policy = NULL, *nvconfig;
2918 2724  
2919      -static int
2920      -spa_ld_get_props(spa_t *spa)
2921      -{
2922      -        int error = 0;
2923      -        uint64_t obj;
2924      -        vdev_t *rvd = spa->spa_root_vdev;
     2725 +                if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
     2726 +                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2925 2727  
     2728 +                if (!spa_is_root(spa) && nvlist_lookup_uint64(nvconfig,
     2729 +                    ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
     2730 +                        char *hostname;
     2731 +                        unsigned long myhostid = 0;
     2732 +
     2733 +                        VERIFY(nvlist_lookup_string(nvconfig,
     2734 +                            ZPOOL_CONFIG_HOSTNAME, &hostname) == 0);
     2735 +
     2736 +#ifdef  _KERNEL
     2737 +                        myhostid = zone_get_hostid(NULL);
     2738 +#else   /* _KERNEL */
     2739 +                        /*
     2740 +                         * We're emulating the system's hostid in userland, so
     2741 +                         * we can't use zone_get_hostid().
     2742 +                         */
     2743 +                        (void) ddi_strtoul(hw_serial, NULL, 10, &myhostid);
     2744 +#endif  /* _KERNEL */
     2745 +                        if (hostid != 0 && myhostid != 0 &&
     2746 +                            hostid != myhostid) {
     2747 +                                nvlist_free(nvconfig);
     2748 +                                cmn_err(CE_WARN, "pool '%s' could not be "
     2749 +                                    "loaded as it was last accessed by "
     2750 +                                    "another system (host: %s hostid: 0x%lx). "
     2751 +                                    "See: http://illumos.org/msg/ZFS-8000-EY",
     2752 +                                    spa_name(spa), hostname,
     2753 +                                    (unsigned long)hostid);
     2754 +                                return (SET_ERROR(EBADF));
     2755 +                        }
     2756 +                }
     2757 +                if (nvlist_lookup_nvlist(spa->spa_config,
     2758 +                    ZPOOL_REWIND_POLICY, &policy) == 0)
     2759 +                        VERIFY(nvlist_add_nvlist(nvconfig,
     2760 +                            ZPOOL_REWIND_POLICY, policy) == 0);
     2761 +
     2762 +                spa_config_set(spa, nvconfig);
     2763 +                spa_unload(spa);
     2764 +                spa_deactivate(spa);
     2765 +                spa_activate(spa, orig_mode);
     2766 +
     2767 +                return (spa_load(spa, state, SPA_IMPORT_EXISTING, B_TRUE));
     2768 +        }
     2769 +
2926 2770          /* Grab the secret checksum salt from the MOS. */
2927 2771          error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2928 2772              DMU_POOL_CHECKSUM_SALT, 1,
2929 2773              sizeof (spa->spa_cksum_salt.zcs_bytes),
2930 2774              spa->spa_cksum_salt.zcs_bytes);
2931 2775          if (error == ENOENT) {
2932 2776                  /* Generate a new salt for subsequent use */
2933 2777                  (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
2934 2778                      sizeof (spa->spa_cksum_salt.zcs_bytes));
2935 2779          } else if (error != 0) {
2936      -                spa_load_failed(spa, "unable to retrieve checksum salt from "
2937      -                    "MOS [error=%d]", error);
2938 2780                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2939 2781          }
2940 2782  
2941      -        if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj, B_TRUE) != 0)
     2783 +        if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj) != 0)
2942 2784                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2943 2785          error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj);
2944      -        if (error != 0) {
2945      -                spa_load_failed(spa, "error opening deferred-frees bpobj "
2946      -                    "[error=%d]", error);
     2786 +        if (error != 0)
2947 2787                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2948      -        }
2949 2788  
2950 2789          /*
2951 2790           * Load the bit that tells us to use the new accounting function
2952 2791           * (raid-z deflation).  If we have an older pool, this will not
2953 2792           * be present.
2954 2793           */
2955      -        error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate, B_FALSE);
     2794 +        error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate);
2956 2795          if (error != 0 && error != ENOENT)
2957 2796                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2958 2797  
2959 2798          error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION,
2960      -            &spa->spa_creation_version, B_FALSE);
     2799 +            &spa->spa_creation_version);
2961 2800          if (error != 0 && error != ENOENT)
2962 2801                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2963 2802  
2964 2803          /*
2965 2804           * Load the persistent error log.  If we have an older pool, this will
2966 2805           * not be present.
2967 2806           */
2968      -        error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last,
2969      -            B_FALSE);
     2807 +        error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last);
2970 2808          if (error != 0 && error != ENOENT)
2971 2809                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2972 2810  
2973 2811          error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB,
2974      -            &spa->spa_errlog_scrub, B_FALSE);
     2812 +            &spa->spa_errlog_scrub);
2975 2813          if (error != 0 && error != ENOENT)
2976 2814                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2977 2815  
2978 2816          /*
2979 2817           * Load the history object.  If we have an older pool, this
2980 2818           * will not be present.
2981 2819           */
2982      -        error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history, B_FALSE);
     2820 +        error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history);
2983 2821          if (error != 0 && error != ENOENT)
2984 2822                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2985 2823  
2986 2824          /*
2987 2825           * Load the per-vdev ZAP map. If we have an older pool, this will not
2988 2826           * be present; in this case, defer its creation to a later time to
2989 2827           * avoid dirtying the MOS this early / out of sync context. See
2990 2828           * spa_sync_config_object.
2991 2829           */
2992 2830  
2993 2831          /* The sentinel is only available in the MOS config. */
2994 2832          nvlist_t *mos_config;
2995      -        if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) {
2996      -                spa_load_failed(spa, "unable to retrieve MOS config");
     2833 +        if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0)
2997 2834                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2998      -        }
2999 2835  
3000 2836          error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP,
3001      -            &spa->spa_all_vdev_zaps, B_FALSE);
     2837 +            &spa->spa_all_vdev_zaps);
3002 2838  
3003 2839          if (error == ENOENT) {
3004 2840                  VERIFY(!nvlist_exists(mos_config,
3005 2841                      ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
3006 2842                  spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
3007 2843                  ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3008 2844          } else if (error != 0) {
3009 2845                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3010 2846          } else if (!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)) {
3011 2847                  /*
↓ open down ↓ 3 lines elided ↑ open up ↑
3015 2851                   */
3016 2852                  spa->spa_avz_action = AVZ_ACTION_DESTROY;
3017 2853                  /*
3018 2854                   * We're assuming that no vdevs have had their ZAPs created
3019 2855                   * before this. Better be sure of it.
3020 2856                   */
3021 2857                  ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3022 2858          }
3023 2859          nvlist_free(mos_config);
3024 2860  
3025      -        spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
3026      -
3027      -        error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object,
3028      -            B_FALSE);
3029      -        if (error && error != ENOENT)
3030      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3031      -
3032      -        if (error == 0) {
3033      -                uint64_t autoreplace;
3034      -
3035      -                spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
3036      -                spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
3037      -                spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
3038      -                spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
3039      -                spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
3040      -                spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
3041      -                    &spa->spa_dedup_ditto);
3042      -
3043      -                spa->spa_autoreplace = (autoreplace != 0);
3044      -        }
3045      -
3046 2861          /*
3047      -         * If we are importing a pool with missing top-level vdevs,
3048      -         * we enforce that the pool doesn't panic or get suspended on
3049      -         * error since the likelihood of missing data is extremely high.
3050      -         */
3051      -        if (spa->spa_missing_tvds > 0 &&
3052      -            spa->spa_failmode != ZIO_FAILURE_MODE_CONTINUE &&
3053      -            spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3054      -                spa_load_note(spa, "forcing failmode to 'continue' "
3055      -                    "as some top level vdevs are missing");
3056      -                spa->spa_failmode = ZIO_FAILURE_MODE_CONTINUE;
3057      -        }
3058      -
3059      -        return (0);
3060      -}
3061      -
3062      -static int
3063      -spa_ld_open_aux_vdevs(spa_t *spa, spa_import_type_t type)
3064      -{
3065      -        int error = 0;
3066      -        vdev_t *rvd = spa->spa_root_vdev;
3067      -
3068      -        /*
3069 2862           * If we're assembling the pool from the split-off vdevs of
3070 2863           * an existing pool, we don't want to attach the spares & cache
3071 2864           * devices.
3072 2865           */
3073 2866  
3074 2867          /*
3075 2868           * Load any hot spares for this pool.
3076 2869           */
3077      -        error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object,
3078      -            B_FALSE);
     2870 +        error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object);
3079 2871          if (error != 0 && error != ENOENT)
3080 2872                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3081 2873          if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3082 2874                  ASSERT(spa_version(spa) >= SPA_VERSION_SPARES);
3083 2875                  if (load_nvlist(spa, spa->spa_spares.sav_object,
3084      -                    &spa->spa_spares.sav_config) != 0) {
3085      -                        spa_load_failed(spa, "error loading spares nvlist");
     2876 +                    &spa->spa_spares.sav_config) != 0)
3086 2877                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3087      -                }
3088 2878  
3089 2879                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3090 2880                  spa_load_spares(spa);
3091 2881                  spa_config_exit(spa, SCL_ALL, FTAG);
3092 2882          } else if (error == 0) {
3093 2883                  spa->spa_spares.sav_sync = B_TRUE;
3094 2884          }
3095 2885  
3096 2886          /*
3097 2887           * Load any level 2 ARC devices for this pool.
3098 2888           */
3099 2889          error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
3100      -            &spa->spa_l2cache.sav_object, B_FALSE);
     2890 +            &spa->spa_l2cache.sav_object);
3101 2891          if (error != 0 && error != ENOENT)
3102 2892                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3103 2893          if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3104 2894                  ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
3105 2895                  if (load_nvlist(spa, spa->spa_l2cache.sav_object,
3106      -                    &spa->spa_l2cache.sav_config) != 0) {
3107      -                        spa_load_failed(spa, "error loading l2cache nvlist");
     2896 +                    &spa->spa_l2cache.sav_config) != 0)
3108 2897                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3109      -                }
3110 2898  
3111 2899                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3112 2900                  spa_load_l2cache(spa);
3113 2901                  spa_config_exit(spa, SCL_ALL, FTAG);
3114 2902          } else if (error == 0) {
3115 2903                  spa->spa_l2cache.sav_sync = B_TRUE;
3116 2904          }
3117 2905  
3118      -        return (0);
3119      -}
     2906 +        mp = &spa->spa_meta_policy;
3120 2907  
3121      -static int
3122      -spa_ld_load_vdev_metadata(spa_t *spa)
3123      -{
3124      -        int error = 0;
3125      -        vdev_t *rvd = spa->spa_root_vdev;
     2908 +        spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
     2909 +        spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
     2910 +        spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
     2911 +        spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
     2912 +        spa->spa_dedup_lo_best_effort =
     2913 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
     2914 +        spa->spa_dedup_hi_best_effort =
     2915 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
3126 2916  
     2917 +        mp->spa_enable_meta_placement_selection =
     2918 +            zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
     2919 +        mp->spa_sync_to_special =
     2920 +            zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
     2921 +        mp->spa_ddt_meta_to_special =
     2922 +            zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
     2923 +        mp->spa_zfs_meta_to_special =
     2924 +            zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
     2925 +        mp->spa_small_data_to_special =
     2926 +            zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
     2927 +        spa_set_ddt_classes(spa,
     2928 +            zpool_prop_default_numeric(ZPOOL_PROP_DDT_DESEGREGATION));
     2929 +
     2930 +        spa->spa_resilver_prio =
     2931 +            zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
     2932 +        spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
     2933 +
     2934 +        error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object);
     2935 +        if (error && error != ENOENT)
     2936 +                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
     2937 +
     2938 +        if (error == 0) {
     2939 +                uint64_t autoreplace;
     2940 +                uint64_t val = 0;
     2941 +
     2942 +                spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
     2943 +                spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
     2944 +                spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
     2945 +                spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
     2946 +                spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
     2947 +                spa_prop_find(spa, ZPOOL_PROP_BOOTSIZE, &spa->spa_bootsize);
     2948 +                spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
     2949 +                    &spa->spa_dedup_ditto);
     2950 +                spa_prop_find(spa, ZPOOL_PROP_FORCETRIM, &spa->spa_force_trim);
     2951 +
     2952 +                mutex_enter(&spa->spa_auto_trim_lock);
     2953 +                spa_prop_find(spa, ZPOOL_PROP_AUTOTRIM, &spa->spa_auto_trim);
     2954 +                if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
     2955 +                        spa_auto_trim_taskq_create(spa);
     2956 +                mutex_exit(&spa->spa_auto_trim_lock);
     2957 +
     2958 +                spa_prop_find(spa, ZPOOL_PROP_HIWATERMARK, &spa->spa_hiwat);
     2959 +                spa_prop_find(spa, ZPOOL_PROP_LOWATERMARK, &spa->spa_lowat);
     2960 +                spa_prop_find(spa, ZPOOL_PROP_MINWATERMARK, &spa->spa_minwat);
     2961 +                spa_prop_find(spa, ZPOOL_PROP_DEDUPMETA_DITTO,
     2962 +                    &spa->spa_ddt_meta_copies);
     2963 +                spa_prop_find(spa, ZPOOL_PROP_DDT_DESEGREGATION, &val);
     2964 +                spa_set_ddt_classes(spa, val);
     2965 +
     2966 +                spa_prop_find(spa, ZPOOL_PROP_RESILVER_PRIO,
     2967 +                    &spa->spa_resilver_prio);
     2968 +                spa_prop_find(spa, ZPOOL_PROP_SCRUB_PRIO,
     2969 +                    &spa->spa_scrub_prio);
     2970 +
     2971 +                spa_prop_find(spa, ZPOOL_PROP_DEDUP_BEST_EFFORT,
     2972 +                    &spa->spa_dedup_best_effort);
     2973 +                spa_prop_find(spa, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT,
     2974 +                    &spa->spa_dedup_lo_best_effort);
     2975 +                spa_prop_find(spa, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT,
     2976 +                    &spa->spa_dedup_hi_best_effort);
     2977 +
     2978 +                spa_prop_find(spa, ZPOOL_PROP_META_PLACEMENT,
     2979 +                    &mp->spa_enable_meta_placement_selection);
     2980 +                spa_prop_find(spa, ZPOOL_PROP_SYNC_TO_SPECIAL,
     2981 +                    &mp->spa_sync_to_special);
     2982 +                spa_prop_find(spa, ZPOOL_PROP_DDT_META_TO_METADEV,
     2983 +                    &mp->spa_ddt_meta_to_special);
     2984 +                spa_prop_find(spa, ZPOOL_PROP_ZFS_META_TO_METADEV,
     2985 +                    &mp->spa_zfs_meta_to_special);
     2986 +                spa_prop_find(spa, ZPOOL_PROP_SMALL_DATA_TO_METADEV,
     2987 +                    &mp->spa_small_data_to_special);
     2988 +
     2989 +                spa->spa_autoreplace = (autoreplace != 0);
     2990 +        }
     2991 +
     2992 +        error = spa_dir_prop(spa, DMU_POOL_COS_PROPS,
     2993 +            &spa->spa_cos_props_object);
     2994 +        if (error == 0)
     2995 +                (void) spa_load_cos_props(spa);
     2996 +        error = spa_dir_prop(spa, DMU_POOL_VDEV_PROPS,
     2997 +            &spa->spa_vdev_props_object);
     2998 +        if (error == 0)
     2999 +                (void) spa_load_vdev_props(spa);
     3000 +
     3001 +        (void) spa_dir_prop(spa, DMU_POOL_TRIM_START_TIME,
     3002 +            &spa->spa_man_trim_start_time);
     3003 +        (void) spa_dir_prop(spa, DMU_POOL_TRIM_STOP_TIME,
     3004 +            &spa->spa_man_trim_stop_time);
     3005 +
3127 3006          /*
3128 3007           * If the 'autoreplace' property is set, then post a resource notifying
3129 3008           * the ZFS DE that it should not issue any faults for unopenable
3130 3009           * devices.  We also iterate over the vdevs, and post a sysevent for any
3131 3010           * unopenable vdevs so that the normal autoreplace handler can take
3132 3011           * over.
3133 3012           */
3134      -        if (spa->spa_autoreplace && spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
     3013 +        if (spa->spa_autoreplace && state != SPA_LOAD_TRYIMPORT) {
3135 3014                  spa_check_removed(spa->spa_root_vdev);
3136 3015                  /*
3137 3016                   * For the import case, this is done in spa_import(), because
3138 3017                   * at this point we're using the spare definitions from
3139 3018                   * the MOS config, not necessarily from the userland config.
3140 3019                   */
3141      -                if (spa->spa_load_state != SPA_LOAD_IMPORT) {
     3020 +                if (state != SPA_LOAD_IMPORT) {
3142 3021                          spa_aux_check_removed(&spa->spa_spares);
3143 3022                          spa_aux_check_removed(&spa->spa_l2cache);
3144 3023                  }
3145 3024          }
3146 3025  
3147 3026          /*
3148      -         * Load the vdev metadata such as metaslabs, DTLs, spacemap object, etc.
     3027 +         * Load the vdev state for all toplevel vdevs.
3149 3028           */
3150      -        error = vdev_load(rvd);
3151      -        if (error != 0) {
3152      -                spa_load_failed(spa, "vdev_load failed [error=%d]", error);
3153      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
3154      -        }
     3029 +        vdev_load(rvd);
3155 3030  
3156 3031          /*
3157      -         * Propagate the leaf DTLs we just loaded all the way up the vdev tree.
     3032 +         * Propagate the leaf DTLs we just loaded all the way up the tree.
3158 3033           */
3159 3034          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3160 3035          vdev_dtl_reassess(rvd, 0, 0, B_FALSE);
3161 3036          spa_config_exit(spa, SCL_ALL, FTAG);
3162 3037  
3163      -        return (0);
3164      -}
3165      -
3166      -static int
3167      -spa_ld_load_dedup_tables(spa_t *spa)
3168      -{
3169      -        int error = 0;
3170      -        vdev_t *rvd = spa->spa_root_vdev;
3171      -
3172      -        error = ddt_load(spa);
3173      -        if (error != 0) {
3174      -                spa_load_failed(spa, "ddt_load failed [error=%d]", error);
3175      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3176      -        }
3177      -
3178      -        return (0);
3179      -}
3180      -
3181      -static int
3182      -spa_ld_verify_logs(spa_t *spa, spa_import_type_t type, char **ereport)
3183      -{
3184      -        vdev_t *rvd = spa->spa_root_vdev;
3185      -
3186      -        if (type != SPA_IMPORT_ASSEMBLE && spa_writeable(spa)) {
3187      -                boolean_t missing = spa_check_logs(spa);
3188      -                if (missing) {
3189      -                        if (spa->spa_missing_tvds != 0) {
3190      -                                spa_load_note(spa, "spa_check_logs failed "
3191      -                                    "so dropping the logs");
3192      -                        } else {
3193      -                                *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
3194      -                                spa_load_failed(spa, "spa_check_logs failed");
3195      -                                return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG,
3196      -                                    ENXIO));
3197      -                        }
3198      -                }
3199      -        }
3200      -
3201      -        return (0);
3202      -}
3203      -
3204      -static int
3205      -spa_ld_verify_pool_data(spa_t *spa)
3206      -{
3207      -        int error = 0;
3208      -        vdev_t *rvd = spa->spa_root_vdev;
3209      -
3210 3038          /*
3211      -         * We've successfully opened the pool, verify that we're ready
3212      -         * to start pushing transactions.
     3039 +         * Load the DDTs (dedup tables).
3213 3040           */
3214      -        if (spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3215      -                error = spa_load_verify(spa);
3216      -                if (error != 0) {
3217      -                        spa_load_failed(spa, "spa_load_verify failed "
3218      -                            "[error=%d]", error);
3219      -                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
3220      -                            error));
3221      -                }
3222      -        }
3223      -
3224      -        return (0);
3225      -}
3226      -
3227      -static void
3228      -spa_ld_claim_log_blocks(spa_t *spa)
3229      -{
3230      -        dmu_tx_t *tx;
3231      -        dsl_pool_t *dp = spa_get_dsl(spa);
3232      -
3233      -        /*
3234      -         * Claim log blocks that haven't been committed yet.
3235      -         * This must all happen in a single txg.
3236      -         * Note: spa_claim_max_txg is updated by spa_claim_notify(),
3237      -         * invoked from zil_claim_log_block()'s i/o done callback.
3238      -         * Price of rollback is that we abandon the log.
3239      -         */
3240      -        spa->spa_claiming = B_TRUE;
3241      -
3242      -        tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
3243      -        (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
3244      -            zil_claim, tx, DS_FIND_CHILDREN);
3245      -        dmu_tx_commit(tx);
3246      -
3247      -        spa->spa_claiming = B_FALSE;
3248      -
3249      -        spa_set_log_state(spa, SPA_LOG_GOOD);
3250      -}
3251      -
3252      -static void
3253      -spa_ld_check_for_config_update(spa_t *spa, uint64_t config_cache_txg,
3254      -    boolean_t reloading)
3255      -{
3256      -        vdev_t *rvd = spa->spa_root_vdev;
3257      -        int need_update = B_FALSE;
3258      -
3259      -        /*
3260      -         * If the config cache is stale, or we have uninitialized
3261      -         * metaslabs (see spa_vdev_add()), then update the config.
3262      -         *
3263      -         * If this is a verbatim import, trust the current
3264      -         * in-core spa_config and update the disk labels.
3265      -         */
3266      -        if (reloading || config_cache_txg != spa->spa_config_txg ||
3267      -            spa->spa_load_state == SPA_LOAD_IMPORT ||
3268      -            spa->spa_load_state == SPA_LOAD_RECOVER ||
3269      -            (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
3270      -                need_update = B_TRUE;
3271      -
3272      -        for (int c = 0; c < rvd->vdev_children; c++)
3273      -                if (rvd->vdev_child[c]->vdev_ms_array == 0)
3274      -                        need_update = B_TRUE;
3275      -
3276      -        /*
3277      -         * Update the config cache asychronously in case we're the
3278      -         * root pool, in which case the config cache isn't writable yet.
3279      -         */
3280      -        if (need_update)
3281      -                spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
3282      -}
3283      -
3284      -static void
3285      -spa_ld_prepare_for_reload(spa_t *spa)
3286      -{
3287      -        int mode = spa->spa_mode;
3288      -        int async_suspended = spa->spa_async_suspended;
3289      -
3290      -        spa_unload(spa);
3291      -        spa_deactivate(spa);
3292      -        spa_activate(spa, mode);
3293      -
3294      -        /*
3295      -         * We save the value of spa_async_suspended as it gets reset to 0 by
3296      -         * spa_unload(). We want to restore it back to the original value before
3297      -         * returning as we might be calling spa_async_resume() later.
3298      -         */
3299      -        spa->spa_async_suspended = async_suspended;
3300      -}
3301      -
3302      -/*
3303      - * Load an existing storage pool, using the config provided. This config
3304      - * describes which vdevs are part of the pool and is later validated against
3305      - * partial configs present in each vdev's label and an entire copy of the
3306      - * config stored in the MOS.
3307      - */
3308      -static int
3309      -spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
3310      -    boolean_t reloading)
3311      -{
3312      -        int error = 0;
3313      -        boolean_t missing_feat_write = B_FALSE;
3314      -
3315      -        ASSERT(MUTEX_HELD(&spa_namespace_lock));
3316      -        ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE);
3317      -
3318      -        /*
3319      -         * Never trust the config that is provided unless we are assembling
3320      -         * a pool following a split.
3321      -         * This means don't trust blkptrs and the vdev tree in general. This
3322      -         * also effectively puts the spa in read-only mode since
3323      -         * spa_writeable() checks for spa_trust_config to be true.
3324      -         * We will later load a trusted config from the MOS.
3325      -         */
3326      -        if (type != SPA_IMPORT_ASSEMBLE)
3327      -                spa->spa_trust_config = B_FALSE;
3328      -
3329      -        if (reloading)
3330      -                spa_load_note(spa, "RELOADING");
3331      -        else
3332      -                spa_load_note(spa, "LOADING");
3333      -
3334      -        /*
3335      -         * Parse the config provided to create a vdev tree.
3336      -         */
3337      -        error = spa_ld_parse_config(spa, type);
     3041 +        error = ddt_load(spa);
3338 3042          if (error != 0)
3339      -                return (error);
     3043 +                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3340 3044  
3341      -        /*
3342      -         * Now that we have the vdev tree, try to open each vdev. This involves
3343      -         * opening the underlying physical device, retrieving its geometry and
3344      -         * probing the vdev with a dummy I/O. The state of each vdev will be set
3345      -         * based on the success of those operations. After this we'll be ready
3346      -         * to read from the vdevs.
3347      -         */
3348      -        error = spa_ld_open_vdevs(spa);
3349      -        if (error != 0)
3350      -                return (error);
     3045 +        spa_update_dspace(spa);
3351 3046  
3352 3047          /*
3353      -         * Read the label of each vdev and make sure that the GUIDs stored
3354      -         * there match the GUIDs in the config provided.
3355      -         * If we're assembling a new pool that's been split off from an
3356      -         * existing pool, the labels haven't yet been updated so we skip
3357      -         * validation for now.
     3048 +         * Validate the config, using the MOS config to fill in any
     3049 +         * information which might be missing.  If we fail to validate
     3050 +         * the config then declare the pool unfit for use. If we're
     3051 +         * assembling a pool from a split, the log is not transferred
     3052 +         * over.
3358 3053           */
3359 3054          if (type != SPA_IMPORT_ASSEMBLE) {
3360      -                error = spa_ld_validate_vdevs(spa);
3361      -                if (error != 0)
3362      -                        return (error);
3363      -        }
     3055 +                nvlist_t *nvconfig;
3364 3056  
3365      -        /*
3366      -         * Read vdev labels to find the best uberblock (i.e. latest, unless
3367      -         * spa_load_max_txg is set) and store it in spa_uberblock. We get the
3368      -         * list of features required to read blkptrs in the MOS from the vdev
3369      -         * label with the best uberblock and verify that our version of zfs
3370      -         * supports them all.
3371      -         */
3372      -        error = spa_ld_select_uberblock(spa, type);
3373      -        if (error != 0)
3374      -                return (error);
     3057 +                if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
     3058 +                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3375 3059  
3376      -        /*
3377      -         * Pass that uberblock to the dsl_pool layer which will open the root
3378      -         * blkptr. This blkptr points to the latest version of the MOS and will
3379      -         * allow us to read its contents.
3380      -         */
3381      -        error = spa_ld_open_rootbp(spa);
3382      -        if (error != 0)
3383      -                return (error);
     3060 +                if (!spa_config_valid(spa, nvconfig)) {
     3061 +                        nvlist_free(nvconfig);
     3062 +                        return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
     3063 +                            ENXIO));
     3064 +                }
     3065 +                nvlist_free(nvconfig);
3384 3066  
3385      -        /*
3386      -         * Retrieve the trusted config stored in the MOS and use it to create
3387      -         * a new, exact version of the vdev tree, then reopen all vdevs.
3388      -         */
3389      -        error = spa_ld_load_trusted_config(spa, type, reloading);
3390      -        if (error == EAGAIN) {
3391      -                VERIFY(!reloading);
3392 3067                  /*
3393      -                 * Redo the loading process with the trusted config if it is
3394      -                 * too different from the untrusted config.
     3068 +                 * Now that we've validated the config, check the state of the
     3069 +                 * root vdev.  If it can't be opened, it indicates one or
     3070 +                 * more toplevel vdevs are faulted.
3395 3071                   */
3396      -                spa_ld_prepare_for_reload(spa);
3397      -                return (spa_load_impl(spa, type, ereport, B_TRUE));
3398      -        } else if (error != 0) {
3399      -                return (error);
     3072 +                if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
     3073 +                        return (SET_ERROR(ENXIO));
     3074 +
     3075 +                if (spa_writeable(spa) && spa_check_logs(spa)) {
     3076 +                        *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
     3077 +                        return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG, ENXIO));
     3078 +                }
3400 3079          }
3401 3080  
3402      -        /*
3403      -         * Retrieve the mapping of indirect vdevs. Those vdevs were removed
3404      -         * from the pool and their contents were re-mapped to other vdevs. Note
3405      -         * that everything that we read before this step must have been
3406      -         * rewritten on concrete vdevs after the last device removal was
3407      -         * initiated. Otherwise we could be reading from indirect vdevs before
3408      -         * we have loaded their mappings.
3409      -         */
3410      -        error = spa_ld_open_indirect_vdev_metadata(spa);
3411      -        if (error != 0)
3412      -                return (error);
3413      -
3414      -        /*
3415      -         * Retrieve the full list of active features from the MOS and check if
3416      -         * they are all supported.
3417      -         */
3418      -        error = spa_ld_check_features(spa, &missing_feat_write);
3419      -        if (error != 0)
3420      -                return (error);
3421      -
3422      -        /*
3423      -         * Load several special directories from the MOS needed by the dsl_pool
3424      -         * layer.
3425      -         */
3426      -        error = spa_ld_load_special_directories(spa);
3427      -        if (error != 0)
3428      -                return (error);
3429      -
3430      -        /*
3431      -         * Retrieve pool properties from the MOS.
3432      -         */
3433      -        error = spa_ld_get_props(spa);
3434      -        if (error != 0)
3435      -                return (error);
3436      -
3437      -        /*
3438      -         * Retrieve the list of auxiliary devices - cache devices and spares -
3439      -         * and open them.
3440      -         */
3441      -        error = spa_ld_open_aux_vdevs(spa, type);
3442      -        if (error != 0)
3443      -                return (error);
3444      -
3445      -        /*
3446      -         * Load the metadata for all vdevs. Also check if unopenable devices
3447      -         * should be autoreplaced.
3448      -         */
3449      -        error = spa_ld_load_vdev_metadata(spa);
3450      -        if (error != 0)
3451      -                return (error);
3452      -
3453      -        error = spa_ld_load_dedup_tables(spa);
3454      -        if (error != 0)
3455      -                return (error);
3456      -
3457      -        /*
3458      -         * Verify the logs now to make sure we don't have any unexpected errors
3459      -         * when we claim log blocks later.
3460      -         */
3461      -        error = spa_ld_verify_logs(spa, type, ereport);
3462      -        if (error != 0)
3463      -                return (error);
3464      -
3465 3081          if (missing_feat_write) {
3466      -                ASSERT(spa->spa_load_state == SPA_LOAD_TRYIMPORT);
     3082 +                ASSERT(state == SPA_LOAD_TRYIMPORT);
3467 3083  
3468 3084                  /*
3469 3085                   * At this point, we know that we can open the pool in
3470 3086                   * read-only mode but not read-write mode. We now have enough
3471 3087                   * information and can return to userland.
3472 3088                   */
3473      -                return (spa_vdev_err(spa->spa_root_vdev, VDEV_AUX_UNSUP_FEAT,
3474      -                    ENOTSUP));
     3089 +                return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP));
3475 3090          }
3476 3091  
3477 3092          /*
3478      -         * Traverse the last txgs to make sure the pool was left off in a safe
3479      -         * state. When performing an extreme rewind, we verify the whole pool,
3480      -         * which can take a very long time.
     3093 +         * We've successfully opened the pool, verify that we're ready
     3094 +         * to start pushing transactions.
3481 3095           */
3482      -        error = spa_ld_verify_pool_data(spa);
3483      -        if (error != 0)
3484      -                return (error);
     3096 +        if (state != SPA_LOAD_TRYIMPORT) {
     3097 +                if (error = spa_load_verify(spa)) {
     3098 +                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
     3099 +                            error));
     3100 +                }
     3101 +        }
3485 3102  
3486      -        /*
3487      -         * Calculate the deflated space for the pool. This must be done before
3488      -         * we write anything to the pool because we'd need to update the space
3489      -         * accounting using the deflated sizes.
3490      -         */
3491      -        spa_update_dspace(spa);
3492      -
3493      -        /*
3494      -         * We have now retrieved all the information we needed to open the
3495      -         * pool. If we are importing the pool in read-write mode, a few
3496      -         * additional steps must be performed to finish the import.
3497      -         */
3498      -        if (spa_writeable(spa) && (spa->spa_load_state == SPA_LOAD_RECOVER ||
     3103 +        if (spa_writeable(spa) && (state == SPA_LOAD_RECOVER ||
3499 3104              spa->spa_load_max_txg == UINT64_MAX)) {
3500      -                uint64_t config_cache_txg = spa->spa_config_txg;
     3105 +                dmu_tx_t *tx;
     3106 +                int need_update = B_FALSE;
     3107 +                dsl_pool_t *dp = spa_get_dsl(spa);
3501 3108  
3502      -                ASSERT(spa->spa_load_state != SPA_LOAD_TRYIMPORT);
     3109 +                ASSERT(state != SPA_LOAD_TRYIMPORT);
3503 3110  
3504 3111                  /*
3505      -                 * Traverse the ZIL and claim all blocks.
     3112 +                 * Claim log blocks that haven't been committed yet.
     3113 +                 * This must all happen in a single txg.
     3114 +                 * Note: spa_claim_max_txg is updated by spa_claim_notify(),
     3115 +                 * invoked from zil_claim_log_block()'s i/o done callback.
     3116 +                 * Price of rollback is that we abandon the log.
3506 3117                   */
3507      -                spa_ld_claim_log_blocks(spa);
     3118 +                spa->spa_claiming = B_TRUE;
3508 3119  
3509      -                /*
3510      -                 * Kick-off the syncing thread.
3511      -                 */
     3120 +                tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
     3121 +                (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
     3122 +                    zil_claim, tx, DS_FIND_CHILDREN);
     3123 +                dmu_tx_commit(tx);
     3124 +
     3125 +                spa->spa_claiming = B_FALSE;
     3126 +
     3127 +                spa_set_log_state(spa, SPA_LOG_GOOD);
3512 3128                  spa->spa_sync_on = B_TRUE;
3513 3129                  txg_sync_start(spa->spa_dsl_pool);
3514 3130  
3515 3131                  /*
3516 3132                   * Wait for all claims to sync.  We sync up to the highest
3517 3133                   * claimed log block birth time so that claimed log blocks
3518 3134                   * don't appear to be from the future.  spa_claim_max_txg
3519      -                 * will have been set for us by ZIL traversal operations
3520      -                 * performed above.
     3135 +                 * will have been set for us by either zil_check_log_chain()
     3136 +                 * (invoked from spa_check_logs()) or zil_claim() above.
3521 3137                   */
3522 3138                  txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg);
3523 3139  
3524 3140                  /*
3525      -                 * Check if we need to request an update of the config. On the
3526      -                 * next sync, we would update the config stored in vdev labels
3527      -                 * and the cachefile (by default /etc/zfs/zpool.cache).
     3141 +                 * If the config cache is stale, or we have uninitialized
     3142 +                 * metaslabs (see spa_vdev_add()), then update the config.
     3143 +                 *
     3144 +                 * If this is a verbatim import, trust the current
     3145 +                 * in-core spa_config and update the disk labels.
3528 3146                   */
3529      -                spa_ld_check_for_config_update(spa, config_cache_txg,
3530      -                    reloading);
     3147 +                if (config_cache_txg != spa->spa_config_txg ||
     3148 +                    state == SPA_LOAD_IMPORT ||
     3149 +                    state == SPA_LOAD_RECOVER ||
     3150 +                    (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
     3151 +                        need_update = B_TRUE;
3531 3152  
     3153 +                for (int c = 0; c < rvd->vdev_children; c++)
     3154 +                        if (rvd->vdev_child[c]->vdev_ms_array == 0)
     3155 +                                need_update = B_TRUE;
     3156 +
3532 3157                  /*
     3158 +                 * Update the config cache asychronously in case we're the
     3159 +                 * root pool, in which case the config cache isn't writable yet.
     3160 +                 */
     3161 +                if (need_update)
     3162 +                        spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
     3163 +
     3164 +                /*
3533 3165                   * Check all DTLs to see if anything needs resilvering.
3534 3166                   */
3535 3167                  if (!dsl_scan_resilvering(spa->spa_dsl_pool) &&
3536      -                    vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
     3168 +                    vdev_resilver_needed(rvd, NULL, NULL))
3537 3169                          spa_async_request(spa, SPA_ASYNC_RESILVER);
3538 3170  
3539 3171                  /*
3540 3172                   * Log the fact that we booted up (so that we can detect if
3541 3173                   * we rebooted in the middle of an operation).
3542 3174                   */
3543 3175                  spa_history_log_version(spa, "open");
3544 3176  
3545      -                /*
3546      -                 * Delete any inconsistent datasets.
3547      -                 */
3548      -                (void) dmu_objset_find(spa_name(spa),
3549      -                    dsl_destroy_inconsistent, NULL, DS_FIND_CHILDREN);
     3177 +                dsl_destroy_inconsistent(spa_get_dsl(spa));
3550 3178  
3551 3179                  /*
3552 3180                   * Clean up any stale temporary dataset userrefs.
3553 3181                   */
3554 3182                  dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool);
3555      -
3556      -                spa_restart_removal(spa);
3557      -
3558      -                spa_spawn_aux_threads(spa);
3559 3183          }
3560 3184  
3561      -        spa_load_note(spa, "LOADED");
     3185 +        spa_async_request(spa, SPA_ASYNC_L2CACHE_REBUILD);
3562 3186  
3563 3187          return (0);
3564 3188  }
3565 3189  
3566 3190  static int
3567      -spa_load_retry(spa_t *spa, spa_load_state_t state)
     3191 +spa_load_retry(spa_t *spa, spa_load_state_t state, int mosconfig)
3568 3192  {
3569 3193          int mode = spa->spa_mode;
3570 3194  
3571 3195          spa_unload(spa);
3572 3196          spa_deactivate(spa);
3573 3197  
3574 3198          spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1;
3575 3199  
3576 3200          spa_activate(spa, mode);
3577 3201          spa_async_suspend(spa);
3578 3202  
3579      -        spa_load_note(spa, "spa_load_retry: rewind, max txg: %llu",
3580      -            (u_longlong_t)spa->spa_load_max_txg);
3581      -
3582      -        return (spa_load(spa, state, SPA_IMPORT_EXISTING));
     3203 +        return (spa_load(spa, state, SPA_IMPORT_EXISTING, mosconfig));
3583 3204  }
3584 3205  
3585 3206  /*
3586 3207   * If spa_load() fails this function will try loading prior txg's. If
3587 3208   * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
3588 3209   * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this
3589 3210   * function will not rewind the pool and will return the same error as
3590 3211   * spa_load().
3591 3212   */
3592 3213  static int
3593      -spa_load_best(spa_t *spa, spa_load_state_t state, uint64_t max_request,
3594      -    int rewind_flags)
     3214 +spa_load_best(spa_t *spa, spa_load_state_t state, int mosconfig,
     3215 +    uint64_t max_request, int rewind_flags)
3595 3216  {
3596 3217          nvlist_t *loadinfo = NULL;
3597 3218          nvlist_t *config = NULL;
3598 3219          int load_error, rewind_error;
3599 3220          uint64_t safe_rewind_txg;
3600 3221          uint64_t min_txg;
3601 3222  
3602 3223          if (spa->spa_load_txg && state == SPA_LOAD_RECOVER) {
3603 3224                  spa->spa_load_max_txg = spa->spa_load_txg;
3604 3225                  spa_set_log_state(spa, SPA_LOG_CLEAR);
3605 3226          } else {
3606 3227                  spa->spa_load_max_txg = max_request;
3607 3228                  if (max_request != UINT64_MAX)
3608 3229                          spa->spa_extreme_rewind = B_TRUE;
3609 3230          }
3610 3231  
3611      -        load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING);
     3232 +        load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING,
     3233 +            mosconfig);
3612 3234          if (load_error == 0)
3613 3235                  return (0);
3614 3236  
3615 3237          if (spa->spa_root_vdev != NULL)
3616 3238                  config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3617 3239  
3618 3240          spa->spa_last_ubsync_txg = spa->spa_uberblock.ub_txg;
3619 3241          spa->spa_last_ubsync_txg_ts = spa->spa_uberblock.ub_timestamp;
3620 3242  
3621 3243          if (rewind_flags & ZPOOL_NEVER_REWIND) {
↓ open down ↓ 20 lines elided ↑ open up ↑
3642 3264              TXG_INITIAL : safe_rewind_txg;
3643 3265  
3644 3266          /*
3645 3267           * Continue as long as we're finding errors, we're still within
3646 3268           * the acceptable rewind range, and we're still finding uberblocks
3647 3269           */
3648 3270          while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg &&
3649 3271              spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) {
3650 3272                  if (spa->spa_load_max_txg < safe_rewind_txg)
3651 3273                          spa->spa_extreme_rewind = B_TRUE;
3652      -                rewind_error = spa_load_retry(spa, state);
     3274 +                rewind_error = spa_load_retry(spa, state, mosconfig);
3653 3275          }
3654 3276  
3655 3277          spa->spa_extreme_rewind = B_FALSE;
3656 3278          spa->spa_load_max_txg = UINT64_MAX;
3657 3279  
3658 3280          if (config && (rewind_error || state != SPA_LOAD_RECOVER))
3659 3281                  spa_config_set(spa, config);
3660 3282          else
3661 3283                  nvlist_free(config);
3662 3284  
↓ open down ↓ 26 lines elided ↑ open up ↑
3689 3311   * ambiguous state.
3690 3312   */
3691 3313  static int
3692 3314  spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
3693 3315      nvlist_t **config)
3694 3316  {
3695 3317          spa_t *spa;
3696 3318          spa_load_state_t state = SPA_LOAD_OPEN;
3697 3319          int error;
3698 3320          int locked = B_FALSE;
     3321 +        boolean_t open_with_activation = B_FALSE;
3699 3322  
3700 3323          *spapp = NULL;
3701 3324  
3702 3325          /*
3703 3326           * As disgusting as this is, we need to support recursive calls to this
3704 3327           * function because dsl_dir_open() is called during spa_load(), and ends
3705 3328           * up calling spa_open() again.  The real fix is to figure out how to
3706 3329           * avoid dsl_dir_open() calling this in the first place.
3707 3330           */
3708 3331          if (mutex_owner(&spa_namespace_lock) != curthread) {
↓ open down ↓ 12 lines elided ↑ open up ↑
3721 3344  
3722 3345                  zpool_get_rewind_policy(nvpolicy ? nvpolicy : spa->spa_config,
3723 3346                      &policy);
3724 3347                  if (policy.zrp_request & ZPOOL_DO_REWIND)
3725 3348                          state = SPA_LOAD_RECOVER;
3726 3349  
3727 3350                  spa_activate(spa, spa_mode_global);
3728 3351  
3729 3352                  if (state != SPA_LOAD_RECOVER)
3730 3353                          spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
3731      -                spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
3732 3354  
3733      -                zfs_dbgmsg("spa_open_common: opening %s", pool);
3734      -                error = spa_load_best(spa, state, policy.zrp_txg,
     3355 +                error = spa_load_best(spa, state, B_FALSE, policy.zrp_txg,
3735 3356                      policy.zrp_request);
3736 3357  
3737 3358                  if (error == EBADF) {
3738 3359                          /*
3739 3360                           * If vdev_validate() returns failure (indicated by
3740 3361                           * EBADF), it indicates that one of the vdevs indicates
3741 3362                           * that the pool has been exported or destroyed.  If
3742 3363                           * this is the case, the config cache is out of sync and
3743 3364                           * we should remove the pool from the namespace.
3744 3365                           */
3745 3366                          spa_unload(spa);
3746 3367                          spa_deactivate(spa);
3747      -                        spa_write_cachefile(spa, B_TRUE, B_TRUE);
     3368 +                        spa_config_sync(spa, B_TRUE, B_TRUE);
3748 3369                          spa_remove(spa);
3749 3370                          if (locked)
3750 3371                                  mutex_exit(&spa_namespace_lock);
3751 3372                          return (SET_ERROR(ENOENT));
3752 3373                  }
3753 3374  
3754 3375                  if (error) {
3755 3376                          /*
3756 3377                           * We can't open the pool, but we still have useful
3757 3378                           * information: the state of each vdev after the
↓ open down ↓ 7 lines elided ↑ open up ↑
3765 3386                                      spa->spa_load_info) == 0);
3766 3387                          }
3767 3388                          spa_unload(spa);
3768 3389                          spa_deactivate(spa);
3769 3390                          spa->spa_last_open_failed = error;
3770 3391                          if (locked)
3771 3392                                  mutex_exit(&spa_namespace_lock);
3772 3393                          *spapp = NULL;
3773 3394                          return (error);
3774 3395                  }
     3396 +
     3397 +                open_with_activation = B_TRUE;
3775 3398          }
3776 3399  
3777 3400          spa_open_ref(spa, tag);
3778 3401  
3779 3402          if (config != NULL)
3780 3403                  *config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3781 3404  
3782 3405          /*
3783 3406           * If we've recovered the pool, pass back any information we
3784 3407           * gathered while doing the load.
↓ open down ↓ 3 lines elided ↑ open up ↑
3788 3411                      spa->spa_load_info) == 0);
3789 3412          }
3790 3413  
3791 3414          if (locked) {
3792 3415                  spa->spa_last_open_failed = 0;
3793 3416                  spa->spa_last_ubsync_txg = 0;
3794 3417                  spa->spa_load_txg = 0;
3795 3418                  mutex_exit(&spa_namespace_lock);
3796 3419          }
3797 3420  
     3421 +        if (open_with_activation)
     3422 +                wbc_activate(spa, B_FALSE);
     3423 +
3798 3424          *spapp = spa;
3799 3425  
3800 3426          return (0);
3801 3427  }
3802 3428  
3803 3429  int
3804 3430  spa_open_rewind(const char *name, spa_t **spapp, void *tag, nvlist_t *policy,
3805 3431      nvlist_t **config)
3806 3432  {
3807 3433          return (spa_open_common(name, spapp, tag, policy, config));
↓ open down ↓ 429 lines elided ↑ open up ↑
4237 3863          spa_t *spa;
4238 3864          char *altroot = NULL;
4239 3865          vdev_t *rvd;
4240 3866          dsl_pool_t *dp;
4241 3867          dmu_tx_t *tx;
4242 3868          int error = 0;
4243 3869          uint64_t txg = TXG_INITIAL;
4244 3870          nvlist_t **spares, **l2cache;
4245 3871          uint_t nspares, nl2cache;
4246 3872          uint64_t version, obj;
4247      -        boolean_t has_features;
     3873 +        boolean_t has_features = B_FALSE, wbc_feature_exists = B_FALSE;
     3874 +        spa_meta_placement_t *mp;
4248 3875  
4249 3876          /*
4250 3877           * If this pool already exists, return failure.
4251 3878           */
4252 3879          mutex_enter(&spa_namespace_lock);
4253 3880          if (spa_lookup(pool) != NULL) {
4254 3881                  mutex_exit(&spa_namespace_lock);
4255 3882                  return (SET_ERROR(EEXIST));
4256 3883          }
4257 3884  
4258 3885          /*
4259 3886           * Allocate a new spa_t structure.
4260 3887           */
4261 3888          (void) nvlist_lookup_string(props,
4262 3889              zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4263 3890          spa = spa_add(pool, NULL, altroot);
4264 3891          spa_activate(spa, spa_mode_global);
4265 3892  
4266      -        if (props && (error = spa_prop_validate(spa, props))) {
4267      -                spa_deactivate(spa);
4268      -                spa_remove(spa);
4269      -                mutex_exit(&spa_namespace_lock);
4270      -                return (error);
4271      -        }
     3893 +        if (props != NULL) {
     3894 +                nvpair_t *wbc_feature_nvp = NULL;
4272 3895  
4273      -        has_features = B_FALSE;
4274      -        for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
4275      -            elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
4276      -                if (zpool_prop_feature(nvpair_name(elem)))
4277      -                        has_features = B_TRUE;
     3896 +                for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
     3897 +                    elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
     3898 +                        const char *propname = nvpair_name(elem);
     3899 +                        if (zpool_prop_feature(propname)) {
     3900 +                                spa_feature_t feature;
     3901 +                                int err;
     3902 +                                const char *fname = strchr(propname, '@') + 1;
     3903 +
     3904 +                                err = zfeature_lookup_name(fname, &feature);
     3905 +                                if (err == 0 && feature == SPA_FEATURE_WBC) {
     3906 +                                        wbc_feature_nvp = elem;
     3907 +                                        wbc_feature_exists = B_TRUE;
     3908 +                                }
     3909 +
     3910 +                                has_features = B_TRUE;
     3911 +                        }
     3912 +                }
     3913 +
     3914 +                /*
     3915 +                 * We do not want to enabled feature@wbc if
     3916 +                 * this pool does not have special vdev.
     3917 +                 * At this stage we remove this feature from common list,
     3918 +                 * but later after check that special vdev available this
     3919 +                 * feature will be enabled
     3920 +                 */
     3921 +                if (wbc_feature_nvp != NULL)
     3922 +                        fnvlist_remove_nvpair(props, wbc_feature_nvp);
     3923 +
     3924 +                if ((error = spa_prop_validate(spa, props)) != 0) {
     3925 +                        spa_deactivate(spa);
     3926 +                        spa_remove(spa);
     3927 +                        mutex_exit(&spa_namespace_lock);
     3928 +                        return (error);
     3929 +                }
4278 3930          }
4279 3931  
     3932 +
4280 3933          if (has_features || nvlist_lookup_uint64(props,
4281 3934              zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) {
4282 3935                  version = SPA_VERSION;
4283 3936          }
4284 3937          ASSERT(SPA_VERSION_IS_SUPPORTED(version));
4285 3938  
4286 3939          spa->spa_first_txg = txg;
4287 3940          spa->spa_uberblock.ub_txg = txg - 1;
4288 3941          spa->spa_uberblock.ub_version = version;
4289 3942          spa->spa_ubsync = spa->spa_uberblock;
4290 3943          spa->spa_load_state = SPA_LOAD_CREATE;
4291      -        spa->spa_removing_phys.sr_state = DSS_NONE;
4292      -        spa->spa_removing_phys.sr_removing_vdev = -1;
4293      -        spa->spa_removing_phys.sr_prev_indirect_vdev = -1;
4294 3944  
4295 3945          /*
4296 3946           * Create "The Godfather" zio to hold all async IOs
4297 3947           */
4298 3948          spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
4299 3949              KM_SLEEP);
4300 3950          for (int i = 0; i < max_ncpus; i++) {
4301 3951                  spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
4302 3952                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
4303 3953                      ZIO_FLAG_GODFATHER);
↓ open down ↓ 123 lines elided ↑ open up ↑
4427 4077          }
4428 4078          VERIFY3U(0, ==, bpobj_open(&spa->spa_deferred_bpobj,
4429 4079              spa->spa_meta_objset, obj));
4430 4080  
4431 4081          /*
4432 4082           * Create the pool's history object.
4433 4083           */
4434 4084          if (version >= SPA_VERSION_ZPOOL_HISTORY)
4435 4085                  spa_history_create_obj(spa, tx);
4436 4086  
     4087 +        mp = &spa->spa_meta_policy;
     4088 +
4437 4089          /*
4438 4090           * Generate some random noise for salted checksums to operate on.
4439 4091           */
4440 4092          (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
4441 4093              sizeof (spa->spa_cksum_salt.zcs_bytes));
4442 4094  
4443 4095          /*
4444 4096           * Set pool properties.
4445 4097           */
4446 4098          spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS);
4447 4099          spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
4448 4100          spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE);
4449 4101          spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND);
     4102 +        spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
     4103 +        spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
     4104 +        spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
     4105 +        spa->spa_ddt_meta_copies =
     4106 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUPMETA_DITTO);
     4107 +        spa->spa_dedup_best_effort =
     4108 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_BEST_EFFORT);
     4109 +        spa->spa_dedup_lo_best_effort =
     4110 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
     4111 +        spa->spa_dedup_hi_best_effort =
     4112 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
     4113 +        spa->spa_force_trim = zpool_prop_default_numeric(ZPOOL_PROP_FORCETRIM);
4450 4114  
     4115 +        spa->spa_resilver_prio =
     4116 +            zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
     4117 +        spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
     4118 +
     4119 +        mutex_enter(&spa->spa_auto_trim_lock);
     4120 +        spa->spa_auto_trim = zpool_prop_default_numeric(ZPOOL_PROP_AUTOTRIM);
     4121 +        if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
     4122 +                spa_auto_trim_taskq_create(spa);
     4123 +        mutex_exit(&spa->spa_auto_trim_lock);
     4124 +
     4125 +        mp->spa_enable_meta_placement_selection =
     4126 +            zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
     4127 +        mp->spa_sync_to_special =
     4128 +            zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
     4129 +        mp->spa_ddt_meta_to_special =
     4130 +            zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
     4131 +        mp->spa_zfs_meta_to_special =
     4132 +            zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
     4133 +        mp->spa_small_data_to_special =
     4134 +            zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
     4135 +
     4136 +        spa_set_ddt_classes(spa, 0);
     4137 +
4451 4138          if (props != NULL) {
4452 4139                  spa_configfile_set(spa, props, B_FALSE);
4453 4140                  spa_sync_props(props, tx);
4454 4141          }
4455 4142  
     4143 +        if (spa_has_special(spa)) {
     4144 +                spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
     4145 +                spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
     4146 +
     4147 +                if (wbc_feature_exists)
     4148 +                        spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
     4149 +        }
     4150 +
4456 4151          dmu_tx_commit(tx);
4457 4152  
4458 4153          spa->spa_sync_on = B_TRUE;
4459 4154          txg_sync_start(spa->spa_dsl_pool);
4460 4155  
4461 4156          /*
4462 4157           * We explicitly wait for the first transaction to complete so that our
4463 4158           * bean counters are appropriately updated.
4464 4159           */
4465 4160          txg_wait_synced(spa->spa_dsl_pool, txg);
4466 4161  
4467      -        spa_spawn_aux_threads(spa);
4468      -
4469      -        spa_write_cachefile(spa, B_FALSE, B_TRUE);
     4162 +        spa_config_sync(spa, B_FALSE, B_TRUE);
4470 4163          spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE);
4471 4164  
4472 4165          spa_history_log_version(spa, "create");
4473 4166  
4474 4167          /*
4475 4168           * Don't count references from objsets that are already closed
4476 4169           * and are making their way through the eviction process.
4477 4170           */
4478 4171          spa_evicting_os_wait(spa);
4479 4172          spa->spa_minref = refcount_count(&spa->spa_refcount);
4480 4173          spa->spa_load_state = SPA_LOAD_NONE;
4481 4174  
4482 4175          mutex_exit(&spa_namespace_lock);
4483 4176  
     4177 +        wbc_activate(spa, B_TRUE);
     4178 +
4484 4179          return (0);
4485 4180  }
4486 4181  
     4182 +
     4183 +/*
     4184 + * See if the pool has special tier, and if so, enable/activate
     4185 + * the feature as needed. Activation is not reference counted.
     4186 + */
     4187 +static void
     4188 +spa_check_special_feature(spa_t *spa)
     4189 +{
     4190 +        if (spa_has_special(spa)) {
     4191 +                nvlist_t *props = NULL;
     4192 +
     4193 +                if (!spa_feature_is_enabled(spa, SPA_FEATURE_META_DEVICES)) {
     4194 +                        VERIFY(nvlist_alloc(&props, NV_UNIQUE_NAME, 0) == 0);
     4195 +                        VERIFY(nvlist_add_uint64(props,
     4196 +                            FEATURE_META_DEVICES, 0) == 0);
     4197 +                        VERIFY(spa_prop_set(spa, props) == 0);
     4198 +                        nvlist_free(props);
     4199 +                }
     4200 +
     4201 +                if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
     4202 +                        dmu_tx_t *tx =
     4203 +                            dmu_tx_create_dd(spa->spa_dsl_pool->dp_mos_dir);
     4204 +
     4205 +                        VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0);
     4206 +                        spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
     4207 +                        dmu_tx_commit(tx);
     4208 +                }
     4209 +        }
     4210 +}
     4211 +
     4212 +static void
     4213 +spa_special_feature_activate(void *arg, dmu_tx_t *tx)
     4214 +{
     4215 +        spa_t *spa = (spa_t *)arg;
     4216 +
     4217 +        if (spa_has_special(spa)) {
     4218 +                /* enable and activate as needed */
     4219 +                spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
     4220 +                if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
     4221 +                        spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
     4222 +                }
     4223 +
     4224 +                spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
     4225 +        }
     4226 +}
     4227 +
4487 4228  #ifdef _KERNEL
4488 4229  /*
4489 4230   * Get the root pool information from the root disk, then import the root pool
4490 4231   * during the system boot up time.
4491 4232   */
4492 4233  extern int vdev_disk_read_rootlabel(char *, char *, nvlist_t **);
4493 4234  
4494 4235  static nvlist_t *
4495 4236  spa_generate_rootconf(char *devpath, char *devid, uint64_t *guid)
4496 4237  {
↓ open down ↓ 105 lines elided ↑ open up ↑
4602 4343                  cmn_err(CE_NOTE, "Cannot read the pool label from '%s'",
4603 4344                      devpath);
4604 4345                  return (SET_ERROR(EIO));
4605 4346          }
4606 4347  
4607 4348          VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME,
4608 4349              &pname) == 0);
4609 4350          VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0);
4610 4351  
4611 4352          mutex_enter(&spa_namespace_lock);
4612      -        if ((spa = spa_lookup(pname)) != NULL) {
     4353 +        if ((spa = spa_lookup(pname)) != NULL || spa_config_guid_exists(guid)) {
4613 4354                  /*
4614 4355                   * Remove the existing root pool from the namespace so that we
4615 4356                   * can replace it with the correct config we just read in.
4616 4357                   */
4617 4358                  spa_remove(spa);
4618 4359          }
4619 4360  
4620 4361          spa = spa_add(pname, config, NULL);
4621 4362          spa->spa_is_root = B_TRUE;
4622 4363          spa->spa_import_flags = ZFS_IMPORT_VERBATIM;
4623      -        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
4624      -            &spa->spa_ubsync.ub_version) != 0)
4625      -                spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
4626 4364  
4627 4365          /*
4628 4366           * Build up a vdev tree based on the boot device's label config.
4629 4367           */
4630 4368          VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4631 4369              &nvtop) == 0);
4632 4370          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4633 4371          error = spa_config_parse(spa, &rvd, nvtop, NULL, 0,
4634 4372              VDEV_ALLOC_ROOTPOOL);
4635 4373          spa_config_exit(spa, SCL_ALL, FTAG);
↓ open down ↓ 63 lines elided ↑ open up ↑
4699 4437          spa_t *spa;
4700 4438          char *altroot = NULL;
4701 4439          spa_load_state_t state = SPA_LOAD_IMPORT;
4702 4440          zpool_rewind_policy_t policy;
4703 4441          uint64_t mode = spa_mode_global;
4704 4442          uint64_t readonly = B_FALSE;
4705 4443          int error;
4706 4444          nvlist_t *nvroot;
4707 4445          nvlist_t **spares, **l2cache;
4708 4446          uint_t nspares, nl2cache;
     4447 +        uint64_t guid;
4709 4448  
     4449 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &guid) != 0)
     4450 +                return (SET_ERROR(EINVAL));
     4451 +
4710 4452          /*
4711 4453           * If a pool with this name exists, return failure.
4712 4454           */
4713 4455          mutex_enter(&spa_namespace_lock);
4714      -        if (spa_lookup(pool) != NULL) {
     4456 +        if (spa_lookup(pool) != NULL || spa_config_guid_exists(guid)) {
4715 4457                  mutex_exit(&spa_namespace_lock);
4716 4458                  return (SET_ERROR(EEXIST));
4717 4459          }
4718 4460  
4719 4461          /*
4720 4462           * Create and initialize the spa structure.
4721 4463           */
4722 4464          (void) nvlist_lookup_string(props,
4723 4465              zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4724 4466          (void) nvlist_lookup_uint64(props,
↓ open down ↓ 4 lines elided ↑ open up ↑
4729 4471          spa->spa_import_flags = flags;
4730 4472  
4731 4473          /*
4732 4474           * Verbatim import - Take a pool and insert it into the namespace
4733 4475           * as if it had been loaded at boot.
4734 4476           */
4735 4477          if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) {
4736 4478                  if (props != NULL)
4737 4479                          spa_configfile_set(spa, props, B_FALSE);
4738 4480  
4739      -                spa_write_cachefile(spa, B_FALSE, B_TRUE);
     4481 +                spa_config_sync(spa, B_FALSE, B_TRUE);
4740 4482                  spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4741      -                zfs_dbgmsg("spa_import: verbatim import of %s", pool);
     4483 +
4742 4484                  mutex_exit(&spa_namespace_lock);
4743 4485                  return (0);
4744 4486          }
4745 4487  
4746 4488          spa_activate(spa, mode);
4747 4489  
4748 4490          /*
4749 4491           * Don't start async tasks until we know everything is healthy.
4750 4492           */
4751 4493          spa_async_suspend(spa);
4752 4494  
4753 4495          zpool_get_rewind_policy(config, &policy);
4754 4496          if (policy.zrp_request & ZPOOL_DO_REWIND)
4755 4497                  state = SPA_LOAD_RECOVER;
4756 4498  
4757      -        spa->spa_config_source = SPA_CONFIG_SRC_TRYIMPORT;
4758      -
4759      -        if (state != SPA_LOAD_RECOVER) {
     4499 +        /*
     4500 +         * Pass off the heavy lifting to spa_load().  Pass TRUE for mosconfig
     4501 +         * because the user-supplied config is actually the one to trust when
     4502 +         * doing an import.
     4503 +         */
     4504 +        if (state != SPA_LOAD_RECOVER)
4760 4505                  spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
4761      -                zfs_dbgmsg("spa_import: importing %s", pool);
4762      -        } else {
4763      -                zfs_dbgmsg("spa_import: importing %s, max_txg=%lld "
4764      -                    "(RECOVERY MODE)", pool, (longlong_t)policy.zrp_txg);
4765      -        }
4766      -        error = spa_load_best(spa, state, policy.zrp_txg, policy.zrp_request);
4767 4506  
     4507 +        error = spa_load_best(spa, state, B_TRUE, policy.zrp_txg,
     4508 +            policy.zrp_request);
     4509 +
4768 4510          /*
4769 4511           * Propagate anything learned while loading the pool and pass it
4770 4512           * back to caller (i.e. rewind info, missing devices, etc).
4771 4513           */
4772 4514          VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4773 4515              spa->spa_load_info) == 0);
4774 4516  
4775 4517          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4776 4518          /*
4777 4519           * Toss any existing sparelist, as it doesn't have any validity
↓ open down ↓ 25 lines elided ↑ open up ↑
4803 4545  
4804 4546          if (error != 0 || (props && spa_writeable(spa) &&
4805 4547              (error = spa_prop_set(spa, props)))) {
4806 4548                  spa_unload(spa);
4807 4549                  spa_deactivate(spa);
4808 4550                  spa_remove(spa);
4809 4551                  mutex_exit(&spa_namespace_lock);
4810 4552                  return (error);
4811 4553          }
4812 4554  
4813      -        spa_async_resume(spa);
4814      -
4815 4555          /*
4816 4556           * Override any spares and level 2 cache devices as specified by
4817 4557           * the user, as these may have correct device names/devids, etc.
4818 4558           */
4819 4559          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4820 4560              &spares, &nspares) == 0) {
4821 4561                  if (spa->spa_spares.sav_config)
4822 4562                          VERIFY(nvlist_remove(spa->spa_spares.sav_config,
4823 4563                              ZPOOL_CONFIG_SPARES, DATA_TYPE_NVLIST_ARRAY) == 0);
4824 4564                  else
↓ open down ↓ 15 lines elided ↑ open up ↑
4840 4580                          VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4841 4581                              NV_UNIQUE_NAME, KM_SLEEP) == 0);
4842 4582                  VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4843 4583                      ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4844 4584                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4845 4585                  spa_load_l2cache(spa);
4846 4586                  spa_config_exit(spa, SCL_ALL, FTAG);
4847 4587                  spa->spa_l2cache.sav_sync = B_TRUE;
4848 4588          }
4849 4589  
     4590 +        /* At this point, we can load spare props */
     4591 +        (void) spa_load_vdev_props(spa);
     4592 +
4850 4593          /*
4851 4594           * Check for any removed devices.
4852 4595           */
4853 4596          if (spa->spa_autoreplace) {
4854 4597                  spa_aux_check_removed(&spa->spa_spares);
4855 4598                  spa_aux_check_removed(&spa->spa_l2cache);
4856 4599          }
4857 4600  
4858 4601          if (spa_writeable(spa)) {
4859 4602                  /*
4860 4603                   * Update the config cache to include the newly-imported pool.
4861 4604                   */
4862 4605                  spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
4863 4606          }
4864 4607  
4865 4608          /*
     4609 +         * Start async resume as late as possible to reduce I/O activity when
     4610 +         * importing a pool. This will let any pending txgs (e.g. from scrub
     4611 +         * or resilver) to complete quickly thereby reducing import times in
     4612 +         * such cases.
     4613 +         */
     4614 +        spa_async_resume(spa);
     4615 +
     4616 +        /*
4866 4617           * It's possible that the pool was expanded while it was exported.
4867 4618           * We kick off an async task to handle this for us.
4868 4619           */
4869 4620          spa_async_request(spa, SPA_ASYNC_AUTOEXPAND);
4870 4621  
     4622 +        /* Set/activate meta feature as needed */
     4623 +        if (!spa_writeable(spa))
     4624 +                spa_check_special_feature(spa);
4871 4625          spa_history_log_version(spa, "import");
4872 4626  
4873 4627          spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4874 4628  
4875 4629          mutex_exit(&spa_namespace_lock);
4876 4630  
4877      -        return (0);
     4631 +        if (!spa_writeable(spa))
     4632 +                return (0);
     4633 +
     4634 +        wbc_activate(spa, B_FALSE);
     4635 +
     4636 +        return (dsl_sync_task(spa->spa_name, NULL, spa_special_feature_activate,
     4637 +            spa, 3, ZFS_SPACE_CHECK_RESERVED));
4878 4638  }
4879 4639  
4880 4640  nvlist_t *
4881 4641  spa_tryimport(nvlist_t *tryconfig)
4882 4642  {
4883 4643          nvlist_t *config = NULL;
4884      -        char *poolname, *cachefile;
     4644 +        char *poolname;
4885 4645          spa_t *spa;
4886 4646          uint64_t state;
4887 4647          int error;
4888      -        zpool_rewind_policy_t policy;
4889 4648  
4890 4649          if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname))
4891 4650                  return (NULL);
4892 4651  
4893 4652          if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
4894 4653                  return (NULL);
4895 4654  
4896 4655          /*
4897 4656           * Create and initialize the spa structure.
4898 4657           */
4899 4658          mutex_enter(&spa_namespace_lock);
4900 4659          spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL);
4901 4660          spa_activate(spa, FREAD);
4902 4661  
4903 4662          /*
4904      -         * Rewind pool if a max txg was provided. Note that even though we
4905      -         * retrieve the complete rewind policy, only the rewind txg is relevant
4906      -         * for tryimport.
     4663 +         * Pass off the heavy lifting to spa_load().
     4664 +         * Pass TRUE for mosconfig because the user-supplied config
     4665 +         * is actually the one to trust when doing an import.
4907 4666           */
4908      -        zpool_get_rewind_policy(spa->spa_config, &policy);
4909      -        if (policy.zrp_txg != UINT64_MAX) {
4910      -                spa->spa_load_max_txg = policy.zrp_txg;
4911      -                spa->spa_extreme_rewind = B_TRUE;
4912      -                zfs_dbgmsg("spa_tryimport: importing %s, max_txg=%lld",
4913      -                    poolname, (longlong_t)policy.zrp_txg);
4914      -        } else {
4915      -                zfs_dbgmsg("spa_tryimport: importing %s", poolname);
4916      -        }
     4667 +        error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING, B_TRUE);
4917 4668  
4918      -        if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_CACHEFILE, &cachefile)
4919      -            == 0) {
4920      -                zfs_dbgmsg("spa_tryimport: using cachefile '%s'", cachefile);
4921      -                spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
4922      -        } else {
4923      -                spa->spa_config_source = SPA_CONFIG_SRC_SCAN;
4924      -        }
4925      -
4926      -        error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING);
4927      -
4928 4669          /*
4929 4670           * If 'tryconfig' was at least parsable, return the current config.
4930 4671           */
4931 4672          if (spa->spa_root_vdev != NULL) {
4932 4673                  config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
4933 4674                  VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
4934 4675                      poolname) == 0);
4935 4676                  VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
4936 4677                      state) == 0);
4937 4678                  VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
↓ open down ↓ 54 lines elided ↑ open up ↑
4992 4733   * Pool export/destroy
4993 4734   *
4994 4735   * The act of destroying or exporting a pool is very simple.  We make sure there
4995 4736   * is no more pending I/O and any references to the pool are gone.  Then, we
4996 4737   * update the pool state and sync all the labels to disk, removing the
4997 4738   * configuration from the cache afterwards. If the 'hardforce' flag is set, then
4998 4739   * we don't sync the labels or remove the configuration cache.
4999 4740   */
5000 4741  static int
5001 4742  spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
5002      -    boolean_t force, boolean_t hardforce)
     4743 +    boolean_t force, boolean_t hardforce, boolean_t saveconfig)
5003 4744  {
5004 4745          spa_t *spa;
     4746 +        zfs_autosnap_t *autosnap;
     4747 +        boolean_t wbcthr_stopped = B_FALSE;
5005 4748  
5006 4749          if (oldconfig)
5007 4750                  *oldconfig = NULL;
5008 4751  
5009 4752          if (!(spa_mode_global & FWRITE))
5010 4753                  return (SET_ERROR(EROFS));
5011 4754  
5012 4755          mutex_enter(&spa_namespace_lock);
5013 4756          if ((spa = spa_lookup(pool)) == NULL) {
5014 4757                  mutex_exit(&spa_namespace_lock);
5015 4758                  return (SET_ERROR(ENOENT));
5016 4759          }
5017 4760  
5018 4761          /*
5019      -         * Put a hold on the pool, drop the namespace lock, stop async tasks,
5020      -         * reacquire the namespace lock, and see if we can export.
     4762 +         * Put a hold on the pool, drop the namespace lock, stop async tasks
     4763 +         * and write cache thread, reacquire the namespace lock, and see
     4764 +         * if we can export.
5021 4765           */
5022 4766          spa_open_ref(spa, FTAG);
5023 4767          mutex_exit(&spa_namespace_lock);
     4768 +
     4769 +        autosnap = spa_get_autosnap(spa);
     4770 +        mutex_enter(&autosnap->autosnap_lock);
     4771 +
     4772 +        if (autosnap_has_children_zone(autosnap,
     4773 +            spa_name(spa), B_TRUE)) {
     4774 +                mutex_exit(&autosnap->autosnap_lock);
     4775 +                spa_close(spa, FTAG);
     4776 +                return (EBUSY);
     4777 +        }
     4778 +
     4779 +        mutex_exit(&autosnap->autosnap_lock);
     4780 +
     4781 +        wbcthr_stopped = wbc_stop_thread(spa); /* stop write cache thread */
     4782 +        autosnap_destroyer_thread_stop(spa);
5024 4783          spa_async_suspend(spa);
5025 4784          mutex_enter(&spa_namespace_lock);
5026 4785          spa_close(spa, FTAG);
5027 4786  
5028 4787          /*
5029 4788           * The pool will be in core if it's openable,
5030 4789           * in which case we can modify its state.
5031 4790           */
5032 4791          if (spa->spa_state != POOL_STATE_UNINITIALIZED && spa->spa_sync_on) {
5033 4792                  /*
↓ open down ↓ 6 lines elided ↑ open up ↑
5040 4799                  /*
5041 4800                   * A pool cannot be exported or destroyed if there are active
5042 4801                   * references.  If we are resetting a pool, allow references by
5043 4802                   * fault injection handlers.
5044 4803                   */
5045 4804                  if (!spa_refcount_zero(spa) ||
5046 4805                      (spa->spa_inject_ref != 0 &&
5047 4806                      new_state != POOL_STATE_UNINITIALIZED)) {
5048 4807                          spa_async_resume(spa);
5049 4808                          mutex_exit(&spa_namespace_lock);
     4809 +                        if (wbcthr_stopped)
     4810 +                                (void) wbc_start_thread(spa);
     4811 +                        autosnap_destroyer_thread_start(spa);
5050 4812                          return (SET_ERROR(EBUSY));
5051 4813                  }
5052 4814  
5053 4815                  /*
5054 4816                   * A pool cannot be exported if it has an active shared spare.
5055 4817                   * This is to prevent other pools stealing the active spare
5056 4818                   * from an exported pool. At user's own will, such pool can
5057 4819                   * be forcedly exported.
5058 4820                   */
5059 4821                  if (!force && new_state == POOL_STATE_EXPORTED &&
5060 4822                      spa_has_active_shared_spare(spa)) {
5061 4823                          spa_async_resume(spa);
5062 4824                          mutex_exit(&spa_namespace_lock);
     4825 +                        if (wbcthr_stopped)
     4826 +                                (void) wbc_start_thread(spa);
     4827 +                        autosnap_destroyer_thread_start(spa);
5063 4828                          return (SET_ERROR(EXDEV));
5064 4829                  }
5065 4830  
5066 4831                  /*
5067 4832                   * We want this to be reflected on every label,
5068 4833                   * so mark them all dirty.  spa_unload() will do the
5069 4834                   * final sync that pushes these changes out.
5070 4835                   */
5071 4836                  if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
5072 4837                          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
↓ open down ↓ 1 lines elided ↑ open up ↑
5074 4839                          spa->spa_final_txg = spa_last_synced_txg(spa) +
5075 4840                              TXG_DEFER_SIZE + 1;
5076 4841                          vdev_config_dirty(spa->spa_root_vdev);
5077 4842                          spa_config_exit(spa, SCL_ALL, FTAG);
5078 4843                  }
5079 4844          }
5080 4845  
5081 4846          spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY);
5082 4847  
5083 4848          if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
     4849 +                wbc_deactivate(spa);
     4850 +
5084 4851                  spa_unload(spa);
5085 4852                  spa_deactivate(spa);
5086 4853          }
5087 4854  
5088 4855          if (oldconfig && spa->spa_config)
5089 4856                  VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0);
5090 4857  
5091 4858          if (new_state != POOL_STATE_UNINITIALIZED) {
5092 4859                  if (!hardforce)
5093      -                        spa_write_cachefile(spa, B_TRUE, B_TRUE);
     4860 +                        spa_config_sync(spa, !saveconfig, B_TRUE);
     4861 +
5094 4862                  spa_remove(spa);
5095 4863          }
5096 4864          mutex_exit(&spa_namespace_lock);
5097 4865  
5098 4866          return (0);
5099 4867  }
5100 4868  
5101 4869  /*
5102 4870   * Destroy a storage pool.
5103 4871   */
5104 4872  int
5105 4873  spa_destroy(char *pool)
5106 4874  {
5107 4875          return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL,
5108      -            B_FALSE, B_FALSE));
     4876 +            B_FALSE, B_FALSE, B_FALSE));
5109 4877  }
5110 4878  
5111 4879  /*
5112 4880   * Export a storage pool.
5113 4881   */
5114 4882  int
5115 4883  spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
5116      -    boolean_t hardforce)
     4884 +    boolean_t hardforce, boolean_t saveconfig)
5117 4885  {
5118 4886          return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig,
5119      -            force, hardforce));
     4887 +            force, hardforce, saveconfig));
5120 4888  }
5121 4889  
5122 4890  /*
5123 4891   * Similar to spa_export(), this unloads the spa_t without actually removing it
5124 4892   * from the namespace in any way.
5125 4893   */
5126 4894  int
5127 4895  spa_reset(char *pool)
5128 4896  {
5129 4897          return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL,
5130      -            B_FALSE, B_FALSE));
     4898 +            B_FALSE, B_FALSE, B_FALSE));
5131 4899  }
5132 4900  
5133 4901  /*
5134 4902   * ==========================================================================
5135 4903   * Device manipulation
5136 4904   * ==========================================================================
5137 4905   */
5138 4906  
5139 4907  /*
5140 4908   * Add a device to a storage pool.
5141 4909   */
5142 4910  int
5143 4911  spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
5144 4912  {
5145 4913          uint64_t txg, id;
5146 4914          int error;
5147 4915          vdev_t *rvd = spa->spa_root_vdev;
5148 4916          vdev_t *vd, *tvd;
5149 4917          nvlist_t **spares, **l2cache;
5150 4918          uint_t nspares, nl2cache;
     4919 +        dmu_tx_t *tx = NULL;
5151 4920  
5152 4921          ASSERT(spa_writeable(spa));
5153 4922  
5154 4923          txg = spa_vdev_enter(spa);
5155 4924  
5156 4925          if ((error = spa_config_parse(spa, &vd, nvroot, NULL, 0,
5157 4926              VDEV_ALLOC_ADD)) != 0)
5158 4927                  return (spa_vdev_exit(spa, NULL, txg, error));
5159 4928  
5160 4929          spa->spa_pending_vdev = vd;     /* spa_vdev_exit() will clear this */
↓ open down ↓ 14 lines elided ↑ open up ↑
5175 4944                  return (spa_vdev_exit(spa, vd, txg, error));
5176 4945  
5177 4946          /*
5178 4947           * We must validate the spares and l2cache devices after checking the
5179 4948           * children.  Otherwise, vdev_inuse() will blindly overwrite the spare.
5180 4949           */
5181 4950          if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0)
5182 4951                  return (spa_vdev_exit(spa, vd, txg, error));
5183 4952  
5184 4953          /*
5185      -         * If we are in the middle of a device removal, we can only add
5186      -         * devices which match the existing devices in the pool.
5187      -         * If we are in the middle of a removal, or have some indirect
5188      -         * vdevs, we can not add raidz toplevels.
     4954 +         * Transfer each new top-level vdev from vd to rvd.
5189 4955           */
5190      -        if (spa->spa_vdev_removal != NULL ||
5191      -            spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5192      -                for (int c = 0; c < vd->vdev_children; c++) {
5193      -                        tvd = vd->vdev_child[c];
5194      -                        if (spa->spa_vdev_removal != NULL &&
5195      -                            tvd->vdev_ashift !=
5196      -                            spa->spa_vdev_removal->svr_vdev->vdev_ashift) {
5197      -                                return (spa_vdev_exit(spa, vd, txg, EINVAL));
5198      -                        }
5199      -                        /* Fail if top level vdev is raidz */
5200      -                        if (tvd->vdev_ops == &vdev_raidz_ops) {
5201      -                                return (spa_vdev_exit(spa, vd, txg, EINVAL));
5202      -                        }
5203      -                        /*
5204      -                         * Need the top level mirror to be
5205      -                         * a mirror of leaf vdevs only
5206      -                         */
5207      -                        if (tvd->vdev_ops == &vdev_mirror_ops) {
5208      -                                for (uint64_t cid = 0;
5209      -                                    cid < tvd->vdev_children; cid++) {
5210      -                                        vdev_t *cvd = tvd->vdev_child[cid];
5211      -                                        if (!cvd->vdev_ops->vdev_op_leaf) {
5212      -                                                return (spa_vdev_exit(spa, vd,
5213      -                                                    txg, EINVAL));
5214      -                                        }
5215      -                                }
5216      -                        }
5217      -                }
5218      -        }
5219      -
5220 4956          for (int c = 0; c < vd->vdev_children; c++) {
5221 4957  
5222 4958                  /*
5223 4959                   * Set the vdev id to the first hole, if one exists.
5224 4960                   */
5225 4961                  for (id = 0; id < rvd->vdev_children; id++) {
5226 4962                          if (rvd->vdev_child[id]->vdev_ishole) {
5227 4963                                  vdev_free(rvd->vdev_child[id]);
5228 4964                                  break;
5229 4965                          }
↓ open down ↓ 32 lines elided ↑ open up ↑
5262 4998           * if we lose power at any point in this sequence, the remaining
5263 4999           * steps will be completed the next time we load the pool.
5264 5000           */
5265 5001          (void) spa_vdev_exit(spa, vd, txg, 0);
5266 5002  
5267 5003          mutex_enter(&spa_namespace_lock);
5268 5004          spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
5269 5005          spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD);
5270 5006          mutex_exit(&spa_namespace_lock);
5271 5007  
     5008 +        /*
     5009 +         * "spa_last_synced_txg(spa) + 1" is used because:
     5010 +         *   - spa_vdev_exit() calls txg_wait_synced() for "txg"
     5011 +         *   - spa_config_update() calls txg_wait_synced() for
     5012 +         *     "spa_last_synced_txg(spa) + 1"
     5013 +         */
     5014 +        tx = dmu_tx_create_assigned(spa_get_dsl(spa),
     5015 +            spa_last_synced_txg(spa) + 1);
     5016 +        spa_special_feature_activate(spa, tx);
     5017 +        dmu_tx_commit(tx);
     5018 +
     5019 +        wbc_activate(spa, B_FALSE);
     5020 +
5272 5021          return (0);
5273 5022  }
5274 5023  
5275 5024  /*
5276 5025   * Attach a device to a mirror.  The arguments are the path to any device
5277 5026   * in the mirror, and the nvroot for the new device.  If the path specifies
5278 5027   * a device that is not mirrored, we automatically insert the mirror vdev.
5279 5028   *
5280 5029   * If 'replacing' is specified, the new device is intended to replace the
5281 5030   * existing device; in this case the two devices are made into their own
↓ open down ↓ 13 lines elided ↑ open up ↑
5295 5044          char *oldvdpath, *newvdpath;
5296 5045          int newvd_isspare;
5297 5046          int error;
5298 5047  
5299 5048          ASSERT(spa_writeable(spa));
5300 5049  
5301 5050          txg = spa_vdev_enter(spa);
5302 5051  
5303 5052          oldvd = spa_lookup_by_guid(spa, guid, B_FALSE);
5304 5053  
5305      -        if (spa->spa_vdev_removal != NULL ||
5306      -            spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5307      -                return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5308      -        }
5309      -
5310 5054          if (oldvd == NULL)
5311 5055                  return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5312 5056  
5313 5057          if (!oldvd->vdev_ops->vdev_op_leaf)
5314 5058                  return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5315 5059  
5316 5060          pvd = oldvd->vdev_parent;
5317 5061  
5318 5062          if ((error = spa_config_parse(spa, &newrootvd, nvroot, NULL, 0,
5319 5063              VDEV_ALLOC_ATTACH)) != 0)
↓ open down ↓ 145 lines elided ↑ open up ↑
5465 5209           * respective datasets.
5466 5210           */
5467 5211          dsl_resilver_restart(spa->spa_dsl_pool, dtl_max_txg);
5468 5212  
5469 5213          if (spa->spa_bootfs)
5470 5214                  spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH);
5471 5215  
5472 5216          spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH);
5473 5217  
5474 5218          /*
     5219 +         * Check CoS property of the old vdev, add reference by new vdev
     5220 +         */
     5221 +        if (oldvd->vdev_queue.vq_cos) {
     5222 +                cos_hold(oldvd->vdev_queue.vq_cos);
     5223 +                newvd->vdev_queue.vq_cos = oldvd->vdev_queue.vq_cos;
     5224 +        }
     5225 +
     5226 +        /*
5475 5227           * Commit the config
5476 5228           */
5477 5229          (void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0);
5478 5230  
5479 5231          spa_history_log_internal(spa, "vdev attach", NULL,
5480 5232              "%s vdev=%s %s vdev=%s",
5481 5233              replacing && newvd_isspare ? "spare in" :
5482 5234              replacing ? "replace" : "attach", newvdpath,
5483 5235              replacing ? "for" : "to", oldvdpath);
5484 5236  
↓ open down ↓ 193 lines elided ↑ open up ↑
5678 5430           * prevent vd from being accessed after it's freed.
5679 5431           */
5680 5432          vdpath = spa_strdup(vd->vdev_path);
5681 5433          for (int t = 0; t < TXG_SIZE; t++)
5682 5434                  (void) txg_list_remove_this(&tvd->vdev_dtl_list, vd, t);
5683 5435          vd->vdev_detached = B_TRUE;
5684 5436          vdev_dirty(tvd, VDD_DTL, vd, txg);
5685 5437  
5686 5438          spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE);
5687 5439  
     5440 +        /*
     5441 +         * Release the references to CoS descriptors if any
     5442 +         */
     5443 +        if (vd->vdev_queue.vq_cos) {
     5444 +                cos_rele(vd->vdev_queue.vq_cos);
     5445 +                vd->vdev_queue.vq_cos = NULL;
     5446 +        }
     5447 +
5688 5448          /* hang on to the spa before we release the lock */
5689 5449          spa_open_ref(spa, FTAG);
5690 5450  
5691 5451          error = spa_vdev_exit(spa, vd, txg, 0);
5692 5452  
5693 5453          spa_history_log_internal(spa, "detach", NULL,
5694 5454              "vdev=%s", vdpath);
5695 5455          spa_strfree(vdpath);
5696 5456  
5697 5457          /*
↓ open down ↓ 42 lines elided ↑ open up ↑
5740 5500          spa_t *newspa;
5741 5501          uint_t c, children, lastlog;
5742 5502          nvlist_t **child, *nvl, *tmp;
5743 5503          dmu_tx_t *tx;
5744 5504          char *altroot = NULL;
5745 5505          vdev_t *rvd, **vml = NULL;                      /* vdev modify list */
5746 5506          boolean_t activate_slog;
5747 5507  
5748 5508          ASSERT(spa_writeable(spa));
5749 5509  
     5510 +        /*
     5511 +         * split for pools with activated WBC
     5512 +         * will be implemented in the next release
     5513 +         */
     5514 +        if (spa_feature_is_active(spa, SPA_FEATURE_WBC))
     5515 +                return (SET_ERROR(ENOTSUP));
     5516 +
5750 5517          txg = spa_vdev_enter(spa);
5751 5518  
5752 5519          /* clear the log and flush everything up to now */
5753 5520          activate_slog = spa_passivate_log(spa);
5754 5521          (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5755      -        error = spa_reset_logs(spa);
     5522 +        error = spa_offline_log(spa);
5756 5523          txg = spa_vdev_config_enter(spa);
5757 5524  
5758 5525          if (activate_slog)
5759 5526                  spa_activate_log(spa);
5760 5527  
5761 5528          if (error != 0)
5762 5529                  return (spa_vdev_exit(spa, NULL, txg, error));
5763 5530  
5764 5531          /* check new spa name before going any further */
5765 5532          if (spa_lookup(newname) != NULL)
↓ open down ↓ 7 lines elided ↑ open up ↑
5773 5540              &children) != 0)
5774 5541                  return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5775 5542  
5776 5543          /* first, check to ensure we've got the right child count */
5777 5544          rvd = spa->spa_root_vdev;
5778 5545          lastlog = 0;
5779 5546          for (c = 0; c < rvd->vdev_children; c++) {
5780 5547                  vdev_t *vd = rvd->vdev_child[c];
5781 5548  
5782 5549                  /* don't count the holes & logs as children */
5783      -                if (vd->vdev_islog || !vdev_is_concrete(vd)) {
     5550 +                if (vd->vdev_islog || vd->vdev_ishole) {
5784 5551                          if (lastlog == 0)
5785 5552                                  lastlog = c;
5786 5553                          continue;
5787 5554                  }
5788 5555  
5789 5556                  lastlog = 0;
5790 5557          }
5791 5558          if (children != (lastlog != 0 ? lastlog : rvd->vdev_children))
5792 5559                  return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5793 5560  
↓ open down ↓ 32 lines elided ↑ open up ↑
5826 5593                  /* look it up in the spa */
5827 5594                  vml[c] = spa_lookup_by_guid(spa, glist[c], B_FALSE);
5828 5595                  if (vml[c] == NULL) {
5829 5596                          error = SET_ERROR(ENODEV);
5830 5597                          break;
5831 5598                  }
5832 5599  
5833 5600                  /* make sure there's nothing stopping the split */
5834 5601                  if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops ||
5835 5602                      vml[c]->vdev_islog ||
5836      -                    !vdev_is_concrete(vml[c]) ||
     5603 +                    vml[c]->vdev_ishole ||
5837 5604                      vml[c]->vdev_isspare ||
5838 5605                      vml[c]->vdev_isl2cache ||
5839 5606                      !vdev_writeable(vml[c]) ||
5840 5607                      vml[c]->vdev_children != 0 ||
5841 5608                      vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
5842 5609                      c != spa->spa_root_vdev->vdev_child[c]->vdev_id) {
5843 5610                          error = SET_ERROR(EINVAL);
5844 5611                          break;
5845 5612                  }
5846 5613  
↓ open down ↓ 74 lines elided ↑ open up ↑
5921 5688  
5922 5689          /* release the spa config lock, retaining the namespace lock */
5923 5690          spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5924 5691  
5925 5692          if (zio_injection_enabled)
5926 5693                  zio_handle_panic_injection(spa, FTAG, 1);
5927 5694  
5928 5695          spa_activate(newspa, spa_mode_global);
5929 5696          spa_async_suspend(newspa);
5930 5697  
5931      -        newspa->spa_config_source = SPA_CONFIG_SRC_SPLIT;
5932      -
5933 5698          /* create the new pool from the disks of the original pool */
5934      -        error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE);
     5699 +        error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE, B_TRUE);
5935 5700          if (error)
5936 5701                  goto out;
5937 5702  
5938 5703          /* if that worked, generate a real config for the new pool */
5939 5704          if (newspa->spa_root_vdev != NULL) {
5940 5705                  VERIFY(nvlist_alloc(&newspa->spa_config_splitting,
5941 5706                      NV_UNIQUE_NAME, KM_SLEEP) == 0);
5942 5707                  VERIFY(nvlist_add_uint64(newspa->spa_config_splitting,
5943 5708                      ZPOOL_CONFIG_SPLIT_GUID, spa_guid(spa)) == 0);
5944 5709                  spa_config_set(newspa, spa_config_generate(newspa, NULL, -1ULL,
↓ open down ↓ 19 lines elided ↑ open up ↑
5964 5729          spa_async_resume(newspa);
5965 5730  
5966 5731          /* finally, update the original pool's config */
5967 5732          txg = spa_vdev_config_enter(spa);
5968 5733          tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
5969 5734          error = dmu_tx_assign(tx, TXG_WAIT);
5970 5735          if (error != 0)
5971 5736                  dmu_tx_abort(tx);
5972 5737          for (c = 0; c < children; c++) {
5973 5738                  if (vml[c] != NULL) {
     5739 +                        vdev_t *tvd = vml[c]->vdev_top;
     5740 +
     5741 +                        /*
     5742 +                         * Need to be sure the detachable VDEV is not
     5743 +                         * on any *other* txg's DTL list to prevent it
     5744 +                         * from being accessed after it's freed.
     5745 +                         */
     5746 +                        for (int t = 0; t < TXG_SIZE; t++) {
     5747 +                                (void) txg_list_remove_this(
     5748 +                                    &tvd->vdev_dtl_list, vml[c], t);
     5749 +                        }
     5750 +
5974 5751                          vdev_split(vml[c]);
5975 5752                          if (error == 0)
5976 5753                                  spa_history_log_internal(spa, "detach", tx,
5977 5754                                      "vdev=%s", vml[c]->vdev_path);
5978 5755  
5979 5756                          vdev_free(vml[c]);
5980 5757                  }
5981 5758          }
5982 5759          spa->spa_avz_action = AVZ_ACTION_REBUILD;
5983 5760          vdev_config_dirty(spa->spa_root_vdev);
↓ open down ↓ 8 lines elided ↑ open up ↑
5992 5769  
5993 5770          /* split is complete; log a history record */
5994 5771          spa_history_log_internal(newspa, "split", NULL,
5995 5772              "from pool %s", spa_name(spa));
5996 5773  
5997 5774          kmem_free(vml, children * sizeof (vdev_t *));
5998 5775  
5999 5776          /* if we're not going to mount the filesystems in userland, export */
6000 5777          if (exp)
6001 5778                  error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL,
6002      -                    B_FALSE, B_FALSE);
     5779 +                    B_FALSE, B_FALSE, B_FALSE);
6003 5780  
6004 5781          return (error);
6005 5782  
6006 5783  out:
6007 5784          spa_unload(newspa);
6008 5785          spa_deactivate(newspa);
6009 5786          spa_remove(newspa);
6010 5787  
6011 5788          txg = spa_vdev_config_enter(spa);
6012 5789  
↓ open down ↓ 5 lines elided ↑ open up ↑
6018 5795          vdev_reopen(spa->spa_root_vdev);
6019 5796  
6020 5797          nvlist_free(spa->spa_config_splitting);
6021 5798          spa->spa_config_splitting = NULL;
6022 5799          (void) spa_vdev_exit(spa, NULL, txg, error);
6023 5800  
6024 5801          kmem_free(vml, children * sizeof (vdev_t *));
6025 5802          return (error);
6026 5803  }
6027 5804  
     5805 +static nvlist_t *
     5806 +spa_nvlist_lookup_by_guid(nvlist_t **nvpp, int count, uint64_t target_guid)
     5807 +{
     5808 +        for (int i = 0; i < count; i++) {
     5809 +                uint64_t guid;
     5810 +
     5811 +                VERIFY(nvlist_lookup_uint64(nvpp[i], ZPOOL_CONFIG_GUID,
     5812 +                    &guid) == 0);
     5813 +
     5814 +                if (guid == target_guid)
     5815 +                        return (nvpp[i]);
     5816 +        }
     5817 +
     5818 +        return (NULL);
     5819 +}
     5820 +
     5821 +static void
     5822 +spa_vdev_remove_aux(nvlist_t *config, char *name, nvlist_t **dev, int count,
     5823 +    nvlist_t *dev_to_remove)
     5824 +{
     5825 +        nvlist_t **newdev = NULL;
     5826 +
     5827 +        if (count > 1)
     5828 +                newdev = kmem_alloc((count - 1) * sizeof (void *), KM_SLEEP);
     5829 +
     5830 +        for (int i = 0, j = 0; i < count; i++) {
     5831 +                if (dev[i] == dev_to_remove)
     5832 +                        continue;
     5833 +                VERIFY(nvlist_dup(dev[i], &newdev[j++], KM_SLEEP) == 0);
     5834 +        }
     5835 +
     5836 +        VERIFY(nvlist_remove(config, name, DATA_TYPE_NVLIST_ARRAY) == 0);
     5837 +        VERIFY(nvlist_add_nvlist_array(config, name, newdev, count - 1) == 0);
     5838 +
     5839 +        for (int i = 0; i < count - 1; i++)
     5840 +                nvlist_free(newdev[i]);
     5841 +
     5842 +        if (count > 1)
     5843 +                kmem_free(newdev, (count - 1) * sizeof (void *));
     5844 +}
     5845 +
6028 5846  /*
     5847 + * Evacuate the device.
     5848 + */
     5849 +static int
     5850 +spa_vdev_remove_evacuate(spa_t *spa, vdev_t *vd)
     5851 +{
     5852 +        uint64_t txg;
     5853 +        int error = 0;
     5854 +
     5855 +        ASSERT(MUTEX_HELD(&spa_namespace_lock));
     5856 +        ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
     5857 +        ASSERT(vd == vd->vdev_top);
     5858 +
     5859 +        /*
     5860 +         * Evacuate the device.  We don't hold the config lock as writer
     5861 +         * since we need to do I/O but we do keep the
     5862 +         * spa_namespace_lock held.  Once this completes the device
     5863 +         * should no longer have any blocks allocated on it.
     5864 +         */
     5865 +        if (vd->vdev_islog) {
     5866 +                if (vd->vdev_stat.vs_alloc != 0)
     5867 +                        error = spa_offline_log(spa);
     5868 +        } else {
     5869 +                error = SET_ERROR(ENOTSUP);
     5870 +        }
     5871 +
     5872 +        if (error)
     5873 +                return (error);
     5874 +
     5875 +        /*
     5876 +         * The evacuation succeeded.  Remove any remaining MOS metadata
     5877 +         * associated with this vdev, and wait for these changes to sync.
     5878 +         */
     5879 +        ASSERT0(vd->vdev_stat.vs_alloc);
     5880 +        txg = spa_vdev_config_enter(spa);
     5881 +        vd->vdev_removing = B_TRUE;
     5882 +        vdev_dirty_leaves(vd, VDD_DTL, txg);
     5883 +        vdev_config_dirty(vd);
     5884 +        spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
     5885 +
     5886 +        return (0);
     5887 +}
     5888 +
     5889 +/*
     5890 + * Complete the removal by cleaning up the namespace.
     5891 + */
     5892 +static void
     5893 +spa_vdev_remove_from_namespace(spa_t *spa, vdev_t *vd)
     5894 +{
     5895 +        vdev_t *rvd = spa->spa_root_vdev;
     5896 +        uint64_t id = vd->vdev_id;
     5897 +        boolean_t last_vdev = (id == (rvd->vdev_children - 1));
     5898 +
     5899 +        ASSERT(MUTEX_HELD(&spa_namespace_lock));
     5900 +        ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
     5901 +        ASSERT(vd == vd->vdev_top);
     5902 +
     5903 +        /*
     5904 +         * Only remove any devices which are empty.
     5905 +         */
     5906 +        if (vd->vdev_stat.vs_alloc != 0)
     5907 +                return;
     5908 +
     5909 +        (void) vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
     5910 +
     5911 +        if (list_link_active(&vd->vdev_state_dirty_node))
     5912 +                vdev_state_clean(vd);
     5913 +        if (list_link_active(&vd->vdev_config_dirty_node))
     5914 +                vdev_config_clean(vd);
     5915 +
     5916 +        vdev_free(vd);
     5917 +
     5918 +        if (last_vdev) {
     5919 +                vdev_compact_children(rvd);
     5920 +        } else {
     5921 +                vd = vdev_alloc_common(spa, id, 0, &vdev_hole_ops);
     5922 +                vdev_add_child(rvd, vd);
     5923 +        }
     5924 +        vdev_config_dirty(rvd);
     5925 +
     5926 +        /*
     5927 +         * Reassess the health of our root vdev.
     5928 +         */
     5929 +        vdev_reopen(rvd);
     5930 +}
     5931 +
     5932 +/*
     5933 + * Remove a device from the pool -
     5934 + *
     5935 + * Removing a device from the vdev namespace requires several steps
     5936 + * and can take a significant amount of time.  As a result we use
     5937 + * the spa_vdev_config_[enter/exit] functions which allow us to
     5938 + * grab and release the spa_config_lock while still holding the namespace
     5939 + * lock.  During each step the configuration is synced out.
     5940 + *
     5941 + * Currently, this supports removing only hot spares, slogs, level 2 ARC
     5942 + * and special devices.
     5943 + */
     5944 +int
     5945 +spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare)
     5946 +{
     5947 +        vdev_t *vd;
     5948 +        sysevent_t *ev = NULL;
     5949 +        metaslab_group_t *mg;
     5950 +        nvlist_t **spares, **l2cache, *nv;
     5951 +        uint64_t txg = 0;
     5952 +        uint_t nspares, nl2cache;
     5953 +        int error = 0;
     5954 +        boolean_t locked = MUTEX_HELD(&spa_namespace_lock);
     5955 +
     5956 +        ASSERT(spa_writeable(spa));
     5957 +
     5958 +        if (!locked)
     5959 +                txg = spa_vdev_enter(spa);
     5960 +
     5961 +        vd = spa_lookup_by_guid(spa, guid, B_FALSE);
     5962 +
     5963 +        if (spa->spa_spares.sav_vdevs != NULL &&
     5964 +            nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
     5965 +            ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0 &&
     5966 +            (nv = spa_nvlist_lookup_by_guid(spares, nspares, guid)) != NULL) {
     5967 +                /*
     5968 +                 * Only remove the hot spare if it's not currently in use
     5969 +                 * in this pool.
     5970 +                 */
     5971 +                if (vd == NULL || unspare) {
     5972 +                        if (vd == NULL)
     5973 +                                vd = spa_lookup_by_guid(spa, guid, B_TRUE);
     5974 +
     5975 +                        /*
     5976 +                         * Release the references to CoS descriptors if any
     5977 +                         */
     5978 +                        if (vd != NULL && vd->vdev_queue.vq_cos) {
     5979 +                                cos_rele(vd->vdev_queue.vq_cos);
     5980 +                                vd->vdev_queue.vq_cos = NULL;
     5981 +                        }
     5982 +
     5983 +                        ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
     5984 +                        spa_vdev_remove_aux(spa->spa_spares.sav_config,
     5985 +                            ZPOOL_CONFIG_SPARES, spares, nspares, nv);
     5986 +                        spa_load_spares(spa);
     5987 +                        spa->spa_spares.sav_sync = B_TRUE;
     5988 +                } else {
     5989 +                        error = SET_ERROR(EBUSY);
     5990 +                }
     5991 +        } else if (spa->spa_l2cache.sav_vdevs != NULL &&
     5992 +            nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
     5993 +            ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0 &&
     5994 +            (nv = spa_nvlist_lookup_by_guid(l2cache, nl2cache, guid)) != NULL) {
     5995 +                /*
     5996 +                 * Cache devices can always be removed.
     5997 +                 */
     5998 +                if (vd == NULL)
     5999 +                        vd = spa_lookup_by_guid(spa, guid, B_TRUE);
     6000 +                /*
     6001 +                 * Release the references to CoS descriptors if any
     6002 +                 */
     6003 +                if (vd != NULL && vd->vdev_queue.vq_cos) {
     6004 +                        cos_rele(vd->vdev_queue.vq_cos);
     6005 +                        vd->vdev_queue.vq_cos = NULL;
     6006 +                }
     6007 +
     6008 +                ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
     6009 +                spa_vdev_remove_aux(spa->spa_l2cache.sav_config,
     6010 +                    ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache, nv);
     6011 +                spa_load_l2cache(spa);
     6012 +                spa->spa_l2cache.sav_sync = B_TRUE;
     6013 +        } else if (vd != NULL && vd->vdev_islog) {
     6014 +                ASSERT(!locked);
     6015 +
     6016 +                if (vd != vd->vdev_top)
     6017 +                        return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
     6018 +
     6019 +                mg = vd->vdev_mg;
     6020 +
     6021 +                /*
     6022 +                 * Stop allocating from this vdev.
     6023 +                 */
     6024 +                metaslab_group_passivate(mg);
     6025 +
     6026 +                /*
     6027 +                 * Wait for the youngest allocations and frees to sync,
     6028 +                 * and then wait for the deferral of those frees to finish.
     6029 +                 */
     6030 +                spa_vdev_config_exit(spa, NULL,
     6031 +                    txg + TXG_CONCURRENT_STATES + TXG_DEFER_SIZE, 0, FTAG);
     6032 +
     6033 +                /*
     6034 +                 * Attempt to evacuate the vdev.
     6035 +                 */
     6036 +                error = spa_vdev_remove_evacuate(spa, vd);
     6037 +
     6038 +                txg = spa_vdev_config_enter(spa);
     6039 +
     6040 +                /*
     6041 +                 * If we couldn't evacuate the vdev, unwind.
     6042 +                 */
     6043 +                if (error) {
     6044 +                        metaslab_group_activate(mg);
     6045 +                        return (spa_vdev_exit(spa, NULL, txg, error));
     6046 +                }
     6047 +
     6048 +                /*
     6049 +                 * Release the references to CoS descriptors if any
     6050 +                 */
     6051 +                if (vd->vdev_queue.vq_cos) {
     6052 +                        cos_rele(vd->vdev_queue.vq_cos);
     6053 +                        vd->vdev_queue.vq_cos = NULL;
     6054 +                }
     6055 +
     6056 +                ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
     6057 +
     6058 +                /*
     6059 +                 * Clean up the vdev namespace.
     6060 +                 */
     6061 +                ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
     6062 +                spa_vdev_remove_from_namespace(spa, vd);
     6063 +
     6064 +        } else if (vd != NULL && vdev_is_special(vd)) {
     6065 +                ASSERT(!locked);
     6066 +
     6067 +                if (vd != vd->vdev_top)
     6068 +                        return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
     6069 +
     6070 +                error = spa_special_vdev_remove(spa, vd, &txg);
     6071 +                if (error == 0) {
     6072 +                        ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
     6073 +                        spa_vdev_remove_from_namespace(spa, vd);
     6074 +
     6075 +                        /*
     6076 +                         * User sees this field as 'enablespecial'
     6077 +                         * pool-level property
     6078 +                         */
     6079 +                        spa->spa_usesc = B_FALSE;
     6080 +                }
     6081 +        } else if (vd != NULL) {
     6082 +                /*
     6083 +                 * Normal vdevs cannot be removed (yet).
     6084 +                 */
     6085 +                error = SET_ERROR(ENOTSUP);
     6086 +        } else {
     6087 +                /*
     6088 +                 * There is no vdev of any kind with the specified guid.
     6089 +                 */
     6090 +                error = SET_ERROR(ENOENT);
     6091 +        }
     6092 +
     6093 +        if (!locked)
     6094 +                error = spa_vdev_exit(spa, NULL, txg, error);
     6095 +
     6096 +        if (ev)
     6097 +                spa_event_notify_impl(ev);
     6098 +
     6099 +        return (error);
     6100 +}
     6101 +
     6102 +/*
6029 6103   * Find any device that's done replacing, or a vdev marked 'unspare' that's
6030 6104   * currently spared, so we can detach it.
6031 6105   */
6032 6106  static vdev_t *
6033 6107  spa_vdev_resilver_done_hunt(vdev_t *vd)
6034 6108  {
6035 6109          vdev_t *newvd, *oldvd;
6036 6110  
6037 6111          for (int c = 0; c < vd->vdev_children; c++) {
6038 6112                  oldvd = spa_vdev_resilver_done_hunt(vd->vdev_child[c]);
↓ open down ↓ 16 lines elided ↑ open up ↑
6055 6129                  oldvd = vd->vdev_child[0];
6056 6130  
6057 6131                  if (vdev_dtl_empty(newvd, DTL_MISSING) &&
6058 6132                      vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6059 6133                      !vdev_dtl_required(oldvd))
6060 6134                          return (oldvd);
6061 6135          }
6062 6136  
6063 6137          /*
6064 6138           * Check for a completed resilver with the 'unspare' flag set.
     6139 +         * Also potentially update faulted state.
6065 6140           */
6066 6141          if (vd->vdev_ops == &vdev_spare_ops) {
6067 6142                  vdev_t *first = vd->vdev_child[0];
6068 6143                  vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
6069 6144  
6070 6145                  if (last->vdev_unspare) {
6071 6146                          oldvd = first;
6072 6147                          newvd = last;
6073 6148                  } else if (first->vdev_unspare) {
6074 6149                          oldvd = last;
↓ open down ↓ 1 lines elided ↑ open up ↑
6076 6151                  } else {
6077 6152                          oldvd = NULL;
6078 6153                  }
6079 6154  
6080 6155                  if (oldvd != NULL &&
6081 6156                      vdev_dtl_empty(newvd, DTL_MISSING) &&
6082 6157                      vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6083 6158                      !vdev_dtl_required(oldvd))
6084 6159                          return (oldvd);
6085 6160  
     6161 +                vdev_propagate_state(vd);
     6162 +
6086 6163                  /*
6087 6164                   * If there are more than two spares attached to a disk,
6088 6165                   * and those spares are not required, then we want to
6089 6166                   * attempt to free them up now so that they can be used
6090 6167                   * by other pools.  Once we're back down to a single
6091 6168                   * disk+spare, we stop removing them.
6092 6169                   */
6093 6170                  if (vd->vdev_children > 2) {
6094 6171                          newvd = vd->vdev_child[1];
6095 6172  
↓ open down ↓ 40 lines elided ↑ open up ↑
6136 6213                          return;
6137 6214                  if (sguid && spa_vdev_detach(spa, sguid, ppguid, B_TRUE) != 0)
6138 6215                          return;
6139 6216                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6140 6217          }
6141 6218  
6142 6219          spa_config_exit(spa, SCL_ALL, FTAG);
6143 6220  }
6144 6221  
6145 6222  /*
6146      - * Update the stored path or FRU for this vdev.
6147      - */
6148      -int
6149      -spa_vdev_set_common(spa_t *spa, uint64_t guid, const char *value,
6150      -    boolean_t ispath)
6151      -{
6152      -        vdev_t *vd;
6153      -        boolean_t sync = B_FALSE;
6154      -
6155      -        ASSERT(spa_writeable(spa));
6156      -
6157      -        spa_vdev_state_enter(spa, SCL_ALL);
6158      -
6159      -        if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL)
6160      -                return (spa_vdev_state_exit(spa, NULL, ENOENT));
6161      -
6162      -        if (!vd->vdev_ops->vdev_op_leaf)
6163      -                return (spa_vdev_state_exit(spa, NULL, ENOTSUP));
6164      -
6165      -        if (ispath) {
6166      -                if (strcmp(value, vd->vdev_path) != 0) {
6167      -                        spa_strfree(vd->vdev_path);
6168      -                        vd->vdev_path = spa_strdup(value);
6169      -                        sync = B_TRUE;
6170      -                }
6171      -        } else {
6172      -                if (vd->vdev_fru == NULL) {
6173      -                        vd->vdev_fru = spa_strdup(value);
6174      -                        sync = B_TRUE;
6175      -                } else if (strcmp(value, vd->vdev_fru) != 0) {
6176      -                        spa_strfree(vd->vdev_fru);
6177      -                        vd->vdev_fru = spa_strdup(value);
6178      -                        sync = B_TRUE;
6179      -                }
6180      -        }
6181      -
6182      -        return (spa_vdev_state_exit(spa, sync ? vd : NULL, 0));
6183      -}
6184      -
6185      -int
6186      -spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath)
6187      -{
6188      -        return (spa_vdev_set_common(spa, guid, newpath, B_TRUE));
6189      -}
6190      -
6191      -int
6192      -spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru)
6193      -{
6194      -        return (spa_vdev_set_common(spa, guid, newfru, B_FALSE));
6195      -}
6196      -
6197      -/*
6198 6223   * ==========================================================================
6199 6224   * SPA Scanning
6200 6225   * ==========================================================================
6201 6226   */
6202 6227  int
6203 6228  spa_scrub_pause_resume(spa_t *spa, pool_scrub_cmd_t cmd)
6204 6229  {
6205 6230          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6206 6231  
6207 6232          if (dsl_scan_resilvering(spa->spa_dsl_pool))
↓ open down ↓ 176 lines elided ↑ open up ↑
6384 6409          if (tasks & SPA_ASYNC_RESILVER_DONE)
6385 6410                  spa_vdev_resilver_done(spa);
6386 6411  
6387 6412          /*
6388 6413           * Kick off a resilver.
6389 6414           */
6390 6415          if (tasks & SPA_ASYNC_RESILVER)
6391 6416                  dsl_resilver_restart(spa->spa_dsl_pool, 0);
6392 6417  
6393 6418          /*
     6419 +         * Kick off L2 cache rebuilding.
     6420 +         */
     6421 +        if (tasks & SPA_ASYNC_L2CACHE_REBUILD)
     6422 +                l2arc_spa_rebuild_start(spa);
     6423 +
     6424 +        if (tasks & SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY) {
     6425 +                mutex_enter(&spa->spa_man_trim_lock);
     6426 +                spa_man_trim_taskq_destroy(spa);
     6427 +                mutex_exit(&spa->spa_man_trim_lock);
     6428 +        }
     6429 +
     6430 +        /*
6394 6431           * Let the world know that we're done.
6395 6432           */
6396 6433          mutex_enter(&spa->spa_async_lock);
6397 6434          spa->spa_async_thread = NULL;
6398 6435          cv_broadcast(&spa->spa_async_cv);
6399 6436          mutex_exit(&spa->spa_async_lock);
6400 6437          thread_exit();
6401 6438  }
6402 6439  
6403 6440  void
6404 6441  spa_async_suspend(spa_t *spa)
6405 6442  {
6406 6443          mutex_enter(&spa->spa_async_lock);
6407 6444          spa->spa_async_suspended++;
6408 6445          while (spa->spa_async_thread != NULL)
6409 6446                  cv_wait(&spa->spa_async_cv, &spa->spa_async_lock);
6410 6447          mutex_exit(&spa->spa_async_lock);
6411      -
6412      -        spa_vdev_remove_suspend(spa);
6413      -
6414      -        zthr_t *condense_thread = spa->spa_condense_zthr;
6415      -        if (condense_thread != NULL && zthr_isrunning(condense_thread))
6416      -                VERIFY0(zthr_cancel(condense_thread));
6417 6448  }
6418 6449  
6419 6450  void
6420 6451  spa_async_resume(spa_t *spa)
6421 6452  {
6422 6453          mutex_enter(&spa->spa_async_lock);
6423 6454          ASSERT(spa->spa_async_suspended != 0);
6424 6455          spa->spa_async_suspended--;
6425 6456          mutex_exit(&spa->spa_async_lock);
6426      -        spa_restart_removal(spa);
6427      -
6428      -        zthr_t *condense_thread = spa->spa_condense_zthr;
6429      -        if (condense_thread != NULL && !zthr_isrunning(condense_thread))
6430      -                zthr_resume(condense_thread);
6431 6457  }
6432 6458  
6433 6459  static boolean_t
6434 6460  spa_async_tasks_pending(spa_t *spa)
6435 6461  {
6436 6462          uint_t non_config_tasks;
6437 6463          uint_t config_task;
6438 6464          boolean_t config_task_suspended;
6439 6465  
6440 6466          non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
↓ open down ↓ 24 lines elided ↑ open up ↑
6465 6491  
6466 6492  void
6467 6493  spa_async_request(spa_t *spa, int task)
6468 6494  {
6469 6495          zfs_dbgmsg("spa=%s async request task=%u", spa->spa_name, task);
6470 6496          mutex_enter(&spa->spa_async_lock);
6471 6497          spa->spa_async_tasks |= task;
6472 6498          mutex_exit(&spa->spa_async_lock);
6473 6499  }
6474 6500  
     6501 +void
     6502 +spa_async_unrequest(spa_t *spa, int task)
     6503 +{
     6504 +        zfs_dbgmsg("spa=%s async unrequest task=%u", spa->spa_name, task);
     6505 +        mutex_enter(&spa->spa_async_lock);
     6506 +        spa->spa_async_tasks &= ~task;
     6507 +        mutex_exit(&spa->spa_async_lock);
     6508 +}
     6509 +
6475 6510  /*
6476 6511   * ==========================================================================
6477 6512   * SPA syncing routines
6478 6513   * ==========================================================================
6479 6514   */
6480 6515  
6481 6516  static int
6482 6517  bpobj_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6483 6518  {
6484 6519          bpobj_t *bpo = arg;
↓ open down ↓ 268 lines elided ↑ open up ↑
6753 6788  }
6754 6789  
6755 6790  /*
6756 6791   * Set zpool properties.
6757 6792   */
6758 6793  static void
6759 6794  spa_sync_props(void *arg, dmu_tx_t *tx)
6760 6795  {
6761 6796          nvlist_t *nvp = arg;
6762 6797          spa_t *spa = dmu_tx_pool(tx)->dp_spa;
     6798 +        spa_meta_placement_t *mp = &spa->spa_meta_policy;
6763 6799          objset_t *mos = spa->spa_meta_objset;
6764 6800          nvpair_t *elem = NULL;
6765 6801  
6766 6802          mutex_enter(&spa->spa_props_lock);
6767 6803  
6768 6804          while ((elem = nvlist_next_nvpair(nvp, elem))) {
6769 6805                  uint64_t intval;
6770 6806                  char *strval, *fname;
6771 6807                  zpool_prop_t prop;
6772 6808                  const char *propname;
6773 6809                  zprop_type_t proptype;
6774 6810                  spa_feature_t fid;
6775 6811  
6776 6812                  switch (prop = zpool_name_to_prop(nvpair_name(elem))) {
6777      -                case ZPOOL_PROP_INVAL:
     6813 +                case ZPROP_INVAL:
6778 6814                          /*
6779 6815                           * We checked this earlier in spa_prop_validate().
6780 6816                           */
6781 6817                          ASSERT(zpool_prop_feature(nvpair_name(elem)));
6782 6818  
6783 6819                          fname = strchr(nvpair_name(elem), '@') + 1;
6784 6820                          VERIFY0(zfeature_lookup_name(fname, &fid));
6785 6821  
6786 6822                          spa_feature_enable(spa, fid, tx);
6787 6823                          spa_history_log_internal(spa, "set", tx,
↓ open down ↓ 77 lines elided ↑ open up ↑
6865 6901                                  spa_history_log_internal(spa, "set", tx,
6866 6902                                      "%s=%lld", nvpair_name(elem), intval);
6867 6903                          } else {
6868 6904                                  ASSERT(0); /* not allowed */
6869 6905                          }
6870 6906  
6871 6907                          switch (prop) {
6872 6908                          case ZPOOL_PROP_DELEGATION:
6873 6909                                  spa->spa_delegation = intval;
6874 6910                                  break;
     6911 +                        case ZPOOL_PROP_DDT_DESEGREGATION:
     6912 +                                spa_set_ddt_classes(spa, intval);
     6913 +                                break;
     6914 +                        case ZPOOL_PROP_DEDUP_BEST_EFFORT:
     6915 +                                spa->spa_dedup_best_effort = intval;
     6916 +                                break;
     6917 +                        case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
     6918 +                                spa->spa_dedup_lo_best_effort = intval;
     6919 +                                break;
     6920 +                        case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
     6921 +                                spa->spa_dedup_hi_best_effort = intval;
     6922 +                                break;
6875 6923                          case ZPOOL_PROP_BOOTFS:
6876 6924                                  spa->spa_bootfs = intval;
6877 6925                                  break;
6878 6926                          case ZPOOL_PROP_FAILUREMODE:
6879 6927                                  spa->spa_failmode = intval;
6880 6928                                  break;
     6929 +                        case ZPOOL_PROP_FORCETRIM:
     6930 +                                spa->spa_force_trim = intval;
     6931 +                                break;
     6932 +                        case ZPOOL_PROP_AUTOTRIM:
     6933 +                                mutex_enter(&spa->spa_auto_trim_lock);
     6934 +                                if (intval != spa->spa_auto_trim) {
     6935 +                                        spa->spa_auto_trim = intval;
     6936 +                                        if (intval != 0)
     6937 +                                                spa_auto_trim_taskq_create(spa);
     6938 +                                        else
     6939 +                                                spa_auto_trim_taskq_destroy(
     6940 +                                                    spa);
     6941 +                                }
     6942 +                                mutex_exit(&spa->spa_auto_trim_lock);
     6943 +                                break;
6881 6944                          case ZPOOL_PROP_AUTOEXPAND:
6882 6945                                  spa->spa_autoexpand = intval;
6883 6946                                  if (tx->tx_txg != TXG_INITIAL)
6884 6947                                          spa_async_request(spa,
6885 6948                                              SPA_ASYNC_AUTOEXPAND);
6886 6949                                  break;
6887 6950                          case ZPOOL_PROP_DEDUPDITTO:
6888 6951                                  spa->spa_dedup_ditto = intval;
6889 6952                                  break;
     6953 +                        case ZPOOL_PROP_MINWATERMARK:
     6954 +                                spa->spa_minwat = intval;
     6955 +                                break;
     6956 +                        case ZPOOL_PROP_LOWATERMARK:
     6957 +                                spa->spa_lowat = intval;
     6958 +                                break;
     6959 +                        case ZPOOL_PROP_HIWATERMARK:
     6960 +                                spa->spa_hiwat = intval;
     6961 +                                break;
     6962 +                        case ZPOOL_PROP_DEDUPMETA_DITTO:
     6963 +                                spa->spa_ddt_meta_copies = intval;
     6964 +                                break;
     6965 +                        case ZPOOL_PROP_META_PLACEMENT:
     6966 +                                mp->spa_enable_meta_placement_selection =
     6967 +                                    intval;
     6968 +                                break;
     6969 +                        case ZPOOL_PROP_SYNC_TO_SPECIAL:
     6970 +                                mp->spa_sync_to_special = intval;
     6971 +                                break;
     6972 +                        case ZPOOL_PROP_DDT_META_TO_METADEV:
     6973 +                                mp->spa_ddt_meta_to_special = intval;
     6974 +                                break;
     6975 +                        case ZPOOL_PROP_ZFS_META_TO_METADEV:
     6976 +                                mp->spa_zfs_meta_to_special = intval;
     6977 +                                break;
     6978 +                        case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
     6979 +                                mp->spa_small_data_to_special = intval;
     6980 +                                break;
     6981 +                        case ZPOOL_PROP_RESILVER_PRIO:
     6982 +                                spa->spa_resilver_prio = intval;
     6983 +                                break;
     6984 +                        case ZPOOL_PROP_SCRUB_PRIO:
     6985 +                                spa->spa_scrub_prio = intval;
     6986 +                                break;
6890 6987                          default:
6891 6988                                  break;
6892 6989                          }
6893 6990                  }
6894 6991  
6895 6992          }
6896 6993  
6897 6994          mutex_exit(&spa->spa_props_lock);
6898 6995  }
6899 6996  
↓ open down ↓ 65 lines elided ↑ open up ↑
6965 7062                  VERIFY0(zap_add(spa->spa_meta_objset,
6966 7063                      DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1,
6967 7064                      sizeof (spa->spa_cksum_salt.zcs_bytes),
6968 7065                      spa->spa_cksum_salt.zcs_bytes, tx));
6969 7066          }
6970 7067  
6971 7068          rrw_exit(&dp->dp_config_rwlock, FTAG);
6972 7069  }
6973 7070  
6974 7071  static void
6975      -vdev_indirect_state_sync_verify(vdev_t *vd)
     7072 +spa_initialize_alloc_trees(spa_t *spa, uint32_t max_queue_depth,
     7073 +    uint64_t queue_depth_total)
6976 7074  {
6977      -        vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
6978      -        vdev_indirect_births_t *vib = vd->vdev_indirect_births;
     7075 +        vdev_t *rvd = spa->spa_root_vdev;
     7076 +        boolean_t dva_throttle_enabled = zio_dva_throttle_enabled;
     7077 +        metaslab_class_t *mcs[2] = {
     7078 +                spa_normal_class(spa),
     7079 +                spa_special_class(spa)
     7080 +        };
     7081 +        size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
6979 7082  
6980      -        if (vd->vdev_ops == &vdev_indirect_ops) {
6981      -                ASSERT(vim != NULL);
6982      -                ASSERT(vib != NULL);
6983      -        }
     7083 +        for (size_t i = 0; i < mcs_len; i++) {
     7084 +                metaslab_class_t *mc = mcs[i];
6984 7085  
6985      -        if (vdev_obsolete_sm_object(vd) != 0) {
6986      -                ASSERT(vd->vdev_obsolete_sm != NULL);
6987      -                ASSERT(vd->vdev_removing ||
6988      -                    vd->vdev_ops == &vdev_indirect_ops);
6989      -                ASSERT(vdev_indirect_mapping_num_entries(vim) > 0);
6990      -                ASSERT(vdev_indirect_mapping_bytes_mapped(vim) > 0);
     7086 +                ASSERT0(refcount_count(&mc->mc_alloc_slots));
     7087 +                mc->mc_alloc_max_slots = queue_depth_total;
     7088 +                mc->mc_alloc_throttle_enabled = dva_throttle_enabled;
6991 7089  
6992      -                ASSERT3U(vdev_obsolete_sm_object(vd), ==,
6993      -                    space_map_object(vd->vdev_obsolete_sm));
6994      -                ASSERT3U(vdev_indirect_mapping_bytes_mapped(vim), >=,
6995      -                    space_map_allocated(vd->vdev_obsolete_sm));
     7090 +                ASSERT3U(mc->mc_alloc_max_slots, <=,
     7091 +                    max_queue_depth * rvd->vdev_children);
6996 7092          }
6997      -        ASSERT(vd->vdev_obsolete_segments != NULL);
     7093 +}
6998 7094  
6999      -        /*
7000      -         * Since frees / remaps to an indirect vdev can only
7001      -         * happen in syncing context, the obsolete segments
7002      -         * tree must be empty when we start syncing.
7003      -         */
7004      -        ASSERT0(range_tree_space(vd->vdev_obsolete_segments));
     7095 +static void
     7096 +spa_check_alloc_trees(spa_t *spa)
     7097 +{
     7098 +        metaslab_class_t *mcs[2] = {
     7099 +                spa_normal_class(spa),
     7100 +                spa_special_class(spa)
     7101 +        };
     7102 +        size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
     7103 +
     7104 +        for (size_t i = 0; i < mcs_len; i++) {
     7105 +                metaslab_class_t *mc = mcs[i];
     7106 +
     7107 +                mutex_enter(&mc->mc_alloc_lock);
     7108 +                VERIFY0(avl_numnodes(&mc->mc_alloc_tree));
     7109 +                mutex_exit(&mc->mc_alloc_lock);
     7110 +        }
7005 7111  }
7006 7112  
7007 7113  /*
7008 7114   * Sync the specified transaction group.  New blocks may be dirtied as
7009 7115   * part of the process, so we iterate until it converges.
7010 7116   */
7011 7117  void
7012 7118  spa_sync(spa_t *spa, uint64_t txg)
7013 7119  {
7014 7120          dsl_pool_t *dp = spa->spa_dsl_pool;
↓ open down ↓ 2 lines elided ↑ open up ↑
7017 7123          vdev_t *rvd = spa->spa_root_vdev;
7018 7124          vdev_t *vd;
7019 7125          dmu_tx_t *tx;
7020 7126          int error;
7021 7127          uint32_t max_queue_depth = zfs_vdev_async_write_max_active *
7022 7128              zfs_vdev_queue_depth_pct / 100;
7023 7129  
7024 7130          VERIFY(spa_writeable(spa));
7025 7131  
7026 7132          /*
7027      -         * Wait for i/os issued in open context that need to complete
7028      -         * before this txg syncs.
7029      -         */
7030      -        VERIFY0(zio_wait(spa->spa_txg_zio[txg & TXG_MASK]));
7031      -        spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, 0);
7032      -
7033      -        /*
7034 7133           * Lock out configuration changes.
7035 7134           */
7036 7135          spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7037 7136  
7038 7137          spa->spa_syncing_txg = txg;
7039 7138          spa->spa_sync_pass = 0;
7040 7139  
7041      -        mutex_enter(&spa->spa_alloc_lock);
7042      -        VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7043      -        mutex_exit(&spa->spa_alloc_lock);
     7140 +        spa_check_alloc_trees(spa);
7044 7141  
7045 7142          /*
     7143 +         * Another pool management task might be currently preventing
     7144 +         * from starting and the current txg sync was invoked on its behalf,
     7145 +         * so be prepared to postpone autotrim processing.
     7146 +         */
     7147 +        if (mutex_tryenter(&spa->spa_auto_trim_lock)) {
     7148 +                if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
     7149 +                        spa_auto_trim(spa, txg);
     7150 +                mutex_exit(&spa->spa_auto_trim_lock);
     7151 +        }
     7152 +
     7153 +        /*
7046 7154           * If there are any pending vdev state changes, convert them
7047 7155           * into config changes that go out with this transaction group.
7048 7156           */
7049 7157          spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7050 7158          while (list_head(&spa->spa_state_dirty_list) != NULL) {
7051 7159                  /*
7052 7160                   * We need the write lock here because, for aux vdevs,
7053 7161                   * calling vdev_config_dirty() modifies sav_config.
7054 7162                   * This is ugly and will become unnecessary when we
7055 7163                   * eliminate the aux vdev wart by integrating all vdevs
↓ open down ↓ 54 lines elided ↑ open up ↑
7110 7218  
7111 7219                  /*
7112 7220                   * It is safe to do a lock-free check here because only async
7113 7221                   * allocations look at mg_max_alloc_queue_depth, and async
7114 7222                   * allocations all happen from spa_sync().
7115 7223                   */
7116 7224                  ASSERT0(refcount_count(&mg->mg_alloc_queue_depth));
7117 7225                  mg->mg_max_alloc_queue_depth = max_queue_depth;
7118 7226                  queue_depth_total += mg->mg_max_alloc_queue_depth;
7119 7227          }
7120      -        metaslab_class_t *mc = spa_normal_class(spa);
7121      -        ASSERT0(refcount_count(&mc->mc_alloc_slots));
7122      -        mc->mc_alloc_max_slots = queue_depth_total;
7123      -        mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
7124 7228  
7125      -        ASSERT3U(mc->mc_alloc_max_slots, <=,
7126      -            max_queue_depth * rvd->vdev_children);
     7229 +        spa_initialize_alloc_trees(spa, max_queue_depth,
     7230 +            queue_depth_total);
7127 7231  
7128      -        for (int c = 0; c < rvd->vdev_children; c++) {
7129      -                vdev_t *vd = rvd->vdev_child[c];
7130      -                vdev_indirect_state_sync_verify(vd);
7131      -
7132      -                if (vdev_indirect_should_condense(vd)) {
7133      -                        spa_condense_indirect_start_sync(vd, tx);
7134      -                        break;
7135      -                }
7136      -        }
7137      -
7138 7232          /*
7139 7233           * Iterate to convergence.
7140 7234           */
     7235 +
     7236 +        zfs_autosnap_t *autosnap = spa_get_autosnap(dp->dp_spa);
     7237 +        mutex_enter(&autosnap->autosnap_lock);
     7238 +
     7239 +        autosnap_zone_t *zone = list_head(&autosnap->autosnap_zones);
     7240 +        while (zone != NULL) {
     7241 +                zone->created = B_FALSE;
     7242 +                zone->dirty = B_FALSE;
     7243 +                zone = list_next(&autosnap->autosnap_zones, zone);
     7244 +        }
     7245 +
     7246 +        mutex_exit(&autosnap->autosnap_lock);
     7247 +
7141 7248          do {
7142 7249                  int pass = ++spa->spa_sync_pass;
7143 7250  
7144 7251                  spa_sync_config_object(spa, tx);
7145 7252                  spa_sync_aux_dev(spa, &spa->spa_spares, tx,
7146 7253                      ZPOOL_CONFIG_SPARES, DMU_POOL_SPARES);
7147 7254                  spa_sync_aux_dev(spa, &spa->spa_l2cache, tx,
7148 7255                      ZPOOL_CONFIG_L2CACHE, DMU_POOL_L2CACHE);
7149 7256                  spa_errlog_sync(spa, txg);
7150 7257                  dsl_pool_sync(dp, txg);
↓ open down ↓ 6 lines elided ↑ open up ↑
7157 7264                           * we sync the deferred frees later in pass 1.
7158 7265                           */
7159 7266                          ASSERT3U(pass, >, 1);
7160 7267                          bplist_iterate(free_bpl, bpobj_enqueue_cb,
7161 7268                              &spa->spa_deferred_bpobj, tx);
7162 7269                  }
7163 7270  
7164 7271                  ddt_sync(spa, txg);
7165 7272                  dsl_scan_sync(dp, tx);
7166 7273  
7167      -                if (spa->spa_vdev_removal != NULL)
7168      -                        svr_sync(spa, tx);
7169      -
7170      -                while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7171      -                    != NULL)
     7274 +                while (vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7172 7275                          vdev_sync(vd, txg);
7173 7276  
7174 7277                  if (pass == 1) {
7175 7278                          spa_sync_upgrades(spa, tx);
7176 7279                          ASSERT3U(txg, >=,
7177 7280                              spa->spa_uberblock.ub_rootbp.blk_birth);
7178 7281                          /*
7179 7282                           * Note: We need to check if the MOS is dirty
7180 7283                           * because we could have marked the MOS dirty
7181 7284                           * without updating the uberblock (e.g. if we
↓ open down ↓ 31 lines elided ↑ open up ↑
7213 7316                   * outstanding AVZ operations that weren't completed in
7214 7317                   * spa_sync_config_object.
7215 7318                   */
7216 7319                  uint64_t all_vdev_zap_entry_count;
7217 7320                  ASSERT0(zap_count(spa->spa_meta_objset,
7218 7321                      spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count));
7219 7322                  ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==,
7220 7323                      all_vdev_zap_entry_count);
7221 7324          }
7222 7325  
7223      -        if (spa->spa_vdev_removal != NULL) {
7224      -                ASSERT0(spa->spa_vdev_removal->svr_bytes_done[txg & TXG_MASK]);
7225      -        }
7226      -
7227 7326          /*
7228 7327           * Rewrite the vdev configuration (which includes the uberblock)
7229 7328           * to commit the transaction group.
7230 7329           *
7231 7330           * If there are no dirty vdevs, we sync the uberblock to a few
7232 7331           * random top-level vdevs that are known to be visible in the
7233 7332           * config cache (see spa_vdev_add() for a complete description).
7234 7333           * If there *are* dirty vdevs, sync the uberblock to all vdevs.
7235 7334           */
7236 7335          for (;;) {
7237 7336                  /*
7238 7337                   * We hold SCL_STATE to prevent vdev open/close/etc.
7239 7338                   * while we're attempting to write the vdev labels.
7240 7339                   */
7241 7340                  spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7242 7341  
7243 7342                  if (list_is_empty(&spa->spa_config_dirty_list)) {
7244      -                        vdev_t *svd[SPA_SYNC_MIN_VDEVS];
     7343 +                        vdev_t *svd[SPA_DVAS_PER_BP];
7245 7344                          int svdcount = 0;
7246 7345                          int children = rvd->vdev_children;
7247 7346                          int c0 = spa_get_random(children);
7248 7347  
7249 7348                          for (int c = 0; c < children; c++) {
7250 7349                                  vd = rvd->vdev_child[(c0 + c) % children];
7251      -                                if (vd->vdev_ms_array == 0 || vd->vdev_islog ||
7252      -                                    !vdev_is_concrete(vd))
     7350 +                                if (vd->vdev_ms_array == 0 || vd->vdev_islog)
7253 7351                                          continue;
7254 7352                                  svd[svdcount++] = vd;
7255      -                                if (svdcount == SPA_SYNC_MIN_VDEVS)
     7353 +                                if (svdcount == SPA_DVAS_PER_BP)
7256 7354                                          break;
7257 7355                          }
7258 7356                          error = vdev_config_sync(svd, svdcount, txg);
7259 7357                  } else {
7260 7358                          error = vdev_config_sync(rvd->vdev_child,
7261 7359                              rvd->vdev_children, txg);
7262 7360                  }
7263 7361  
7264 7362                  if (error == 0)
7265 7363                          spa->spa_last_synced_guid = rvd->vdev_guid;
↓ open down ↓ 20 lines elided ↑ open up ↑
7286 7384           * let it become visible to the config cache.
7287 7385           */
7288 7386          if (spa->spa_config_syncing != NULL) {
7289 7387                  spa_config_set(spa, spa->spa_config_syncing);
7290 7388                  spa->spa_config_txg = txg;
7291 7389                  spa->spa_config_syncing = NULL;
7292 7390          }
7293 7391  
7294 7392          dsl_pool_sync_done(dp, txg);
7295 7393  
7296      -        mutex_enter(&spa->spa_alloc_lock);
7297      -        VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7298      -        mutex_exit(&spa->spa_alloc_lock);
     7394 +        spa_check_alloc_trees(spa);
7299 7395  
7300 7396          /*
7301 7397           * Update usable space statistics.
7302 7398           */
7303 7399          while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)))
7304 7400                  vdev_sync_done(vd, txg);
7305 7401  
7306 7402          spa_update_dspace(spa);
7307      -
     7403 +        spa_update_latency(spa);
7308 7404          /*
7309 7405           * It had better be the case that we didn't dirty anything
7310 7406           * since vdev_config_sync().
7311 7407           */
7312 7408          ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7313 7409          ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7314 7410          ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg));
7315 7411  
7316 7412          spa->spa_sync_pass = 0;
7317 7413  
     7414 +        spa_check_special(spa);
     7415 +
7318 7416          /*
7319 7417           * Update the last synced uberblock here. We want to do this at
7320 7418           * the end of spa_sync() so that consumers of spa_last_synced_txg()
7321 7419           * will be guaranteed that all the processing associated with
7322 7420           * that txg has been completed.
7323 7421           */
7324 7422          spa->spa_ubsync = spa->spa_uberblock;
7325 7423          spa_config_exit(spa, SCL_CONFIG, FTAG);
7326 7424  
7327 7425          spa_handle_ignored_writes(spa);
↓ open down ↓ 52 lines elided ↑ open up ↑
7380 7478                   * a device that's been replaced, which requires grabbing
7381 7479                   * spa_namespace_lock, so we must drop it here.
7382 7480                   */
7383 7481                  spa_open_ref(spa, FTAG);
7384 7482                  mutex_exit(&spa_namespace_lock);
7385 7483                  spa_async_suspend(spa);
7386 7484                  mutex_enter(&spa_namespace_lock);
7387 7485                  spa_close(spa, FTAG);
7388 7486  
7389 7487                  if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
     7488 +                        wbc_deactivate(spa);
     7489 +
7390 7490                          spa_unload(spa);
7391 7491                          spa_deactivate(spa);
7392 7492                  }
     7493 +
7393 7494                  spa_remove(spa);
7394 7495          }
7395 7496          mutex_exit(&spa_namespace_lock);
7396 7497  }
7397 7498  
7398 7499  vdev_t *
7399 7500  spa_lookup_by_guid(spa_t *spa, uint64_t guid, boolean_t aux)
7400 7501  {
7401 7502          vdev_t *vd;
7402 7503          int i;
↓ open down ↓ 75 lines elided ↑ open up ↑
7478 7579          for (i = 0; i < sav->sav_count; i++) {
7479 7580                  if (spa_spare_exists(sav->sav_vdevs[i]->vdev_guid, &pool,
7480 7581                      &refcnt) && pool != 0ULL && pool == spa_guid(spa) &&
7481 7582                      refcnt > 2)
7482 7583                          return (B_TRUE);
7483 7584          }
7484 7585  
7485 7586          return (B_FALSE);
7486 7587  }
7487 7588  
7488      -sysevent_t *
     7589 +/*
     7590 + * Post a sysevent corresponding to the given event.  The 'name' must be one of
     7591 + * the event definitions in sys/sysevent/eventdefs.h.  The payload will be
     7592 + * filled in from the spa and (optionally) the vdev.  This doesn't do anything
     7593 + * in the userland libzpool, as we don't want consumers to misinterpret ztest
     7594 + * or zdb as real changes.
     7595 + */
     7596 +static sysevent_t *
7489 7597  spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7490 7598  {
7491 7599          sysevent_t              *ev = NULL;
7492 7600  #ifdef _KERNEL
7493 7601          sysevent_attr_list_t    *attr = NULL;
7494 7602          sysevent_value_t        value;
7495 7603  
7496 7604          ev = sysevent_alloc(EC_ZFS, (char *)name, SUNW_KERN_PUB "zfs",
7497 7605              SE_SLEEP);
7498 7606          ASSERT(ev != NULL);
↓ open down ↓ 1 lines elided ↑ open up ↑
7500 7608          value.value_type = SE_DATA_TYPE_STRING;
7501 7609          value.value.sv_string = spa_name(spa);
7502 7610          if (sysevent_add_attr(&attr, ZFS_EV_POOL_NAME, &value, SE_SLEEP) != 0)
7503 7611                  goto done;
7504 7612  
7505 7613          value.value_type = SE_DATA_TYPE_UINT64;
7506 7614          value.value.sv_uint64 = spa_guid(spa);
7507 7615          if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0)
7508 7616                  goto done;
7509 7617  
7510      -        if (vd) {
     7618 +        if (vd != NULL) {
7511 7619                  value.value_type = SE_DATA_TYPE_UINT64;
7512 7620                  value.value.sv_uint64 = vd->vdev_guid;
7513 7621                  if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value,
7514 7622                      SE_SLEEP) != 0)
7515 7623                          goto done;
7516 7624  
7517 7625                  if (vd->vdev_path) {
7518 7626                          value.value_type = SE_DATA_TYPE_STRING;
7519 7627                          value.value.sv_string = vd->vdev_path;
7520 7628                          if (sysevent_add_attr(&attr, ZFS_EV_VDEV_PATH,
↓ open down ↓ 11 lines elided ↑ open up ↑
7532 7640          attr = NULL;
7533 7641  
7534 7642  done:
7535 7643          if (attr)
7536 7644                  sysevent_free_attr(attr);
7537 7645  
7538 7646  #endif
7539 7647          return (ev);
7540 7648  }
7541 7649  
7542      -void
7543      -spa_event_post(sysevent_t *ev)
     7650 +static void
     7651 +spa_event_post(void *arg)
7544 7652  {
7545 7653  #ifdef _KERNEL
     7654 +        sysevent_t *ev = (sysevent_t *)arg;
     7655 +
7546 7656          sysevent_id_t           eid;
7547 7657  
7548 7658          (void) log_sysevent(ev, SE_SLEEP, &eid);
7549 7659          sysevent_free(ev);
7550 7660  #endif
7551 7661  }
7552 7662  
     7663 +/*
     7664 + * Dispatch event notifications to the taskq such that the corresponding
     7665 + * sysevents are queued with no spa locks held
     7666 + */
     7667 +taskq_t *spa_sysevent_taskq;
     7668 +
     7669 +static void
     7670 +spa_event_notify_impl(sysevent_t *ev)
     7671 +{
     7672 +        if (taskq_dispatch(spa_sysevent_taskq, spa_event_post,
     7673 +            ev, TQ_NOSLEEP) == NULL) {
     7674 +                /*
     7675 +                 * These are management sysevents; as much as it is
     7676 +                 * unpleasant to drop these due to syseventd not being able
     7677 +                 * to keep up, perhaps due to resource shortages, we are not
     7678 +                 * going to sleep here and risk locking up the pool sync
     7679 +                 * process; notify admin of problems
     7680 +                 */
     7681 +                cmn_err(CE_NOTE, "Could not dispatch sysevent nofitication "
     7682 +                    "for %s, please check state of syseventd\n",
     7683 +                    sysevent_get_subclass_name(ev));
     7684 +
     7685 +                sysevent_free(ev);
     7686 +
     7687 +                return;
     7688 +        }
     7689 +}
     7690 +
7553 7691  void
7554      -spa_event_discard(sysevent_t *ev)
     7692 +spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7555 7693  {
7556      -#ifdef _KERNEL
7557      -        sysevent_free(ev);
7558      -#endif
     7694 +        spa_event_notify_impl(spa_event_create(spa, vd, hist_nvl, name));
7559 7695  }
7560 7696  
7561 7697  /*
7562      - * Post a sysevent corresponding to the given event.  The 'name' must be one of
7563      - * the event definitions in sys/sysevent/eventdefs.h.  The payload will be
7564      - * filled in from the spa and (optionally) the vdev and history nvl.  This
7565      - * doesn't do anything in the userland libzpool, as we don't want consumers to
7566      - * misinterpret ztest or zdb as real changes.
     7698 + * Dispatches all auto-trim processing to all top-level vdevs. This is
     7699 + * called from spa_sync once every txg.
7567 7700   */
     7701 +static void
     7702 +spa_auto_trim(spa_t *spa, uint64_t txg)
     7703 +{
     7704 +        ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER) == SCL_CONFIG);
     7705 +        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
     7706 +        ASSERT(spa->spa_auto_trim_taskq != NULL);
     7707 +
     7708 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     7709 +                vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
     7710 +                vti->vti_vdev = spa->spa_root_vdev->vdev_child[i];
     7711 +                vti->vti_txg = txg;
     7712 +                vti->vti_done_cb = (void (*)(void *))spa_vdev_auto_trim_done;
     7713 +                vti->vti_done_arg = spa;
     7714 +                (void) taskq_dispatch(spa->spa_auto_trim_taskq,
     7715 +                    (void (*)(void *))vdev_auto_trim, vti, TQ_SLEEP);
     7716 +                spa->spa_num_auto_trimming++;
     7717 +        }
     7718 +}
     7719 +
     7720 +/*
     7721 + * Performs the sync update of the MOS pool directory's trim start/stop values.
     7722 + */
     7723 +static void
     7724 +spa_trim_update_time_sync(void *arg, dmu_tx_t *tx)
     7725 +{
     7726 +        spa_t *spa = arg;
     7727 +        VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
     7728 +            DMU_POOL_TRIM_START_TIME, sizeof (uint64_t), 1,
     7729 +            &spa->spa_man_trim_start_time, tx));
     7730 +        VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
     7731 +            DMU_POOL_TRIM_STOP_TIME, sizeof (uint64_t), 1,
     7732 +            &spa->spa_man_trim_stop_time, tx));
     7733 +}
     7734 +
     7735 +/*
     7736 + * Updates the in-core and on-disk manual TRIM operation start/stop time.
     7737 + * Passing UINT64_MAX for either start_time or stop_time means that no
     7738 + * update to that value should be recorded.
     7739 + */
     7740 +static dmu_tx_t *
     7741 +spa_trim_update_time(spa_t *spa, uint64_t start_time, uint64_t stop_time)
     7742 +{
     7743 +        int err;
     7744 +        dmu_tx_t *tx;
     7745 +
     7746 +        ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
     7747 +        if (start_time != UINT64_MAX)
     7748 +                spa->spa_man_trim_start_time = start_time;
     7749 +        if (stop_time != UINT64_MAX)
     7750 +                spa->spa_man_trim_stop_time = stop_time;
     7751 +        tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
     7752 +        err = dmu_tx_assign(tx, TXG_WAIT);
     7753 +        if (err) {
     7754 +                dmu_tx_abort(tx);
     7755 +                return (NULL);
     7756 +        }
     7757 +        dsl_sync_task_nowait(spa_get_dsl(spa), spa_trim_update_time_sync,
     7758 +            spa, 1, ZFS_SPACE_CHECK_RESERVED, tx);
     7759 +
     7760 +        return (tx);
     7761 +}
     7762 +
     7763 +/*
     7764 + * Initiates an manual TRIM of the whole pool. This kicks off individual
     7765 + * TRIM tasks for each top-level vdev, which then pass over all of the free
     7766 + * space in all of the vdev's metaslabs and issues TRIM commands for that
     7767 + * space to the underlying vdevs.
     7768 + */
     7769 +extern void
     7770 +spa_man_trim(spa_t *spa, uint64_t rate)
     7771 +{
     7772 +        dmu_tx_t *time_update_tx;
     7773 +
     7774 +        mutex_enter(&spa->spa_man_trim_lock);
     7775 +
     7776 +        if (rate != 0)
     7777 +                spa->spa_man_trim_rate = MAX(rate, spa_min_trim_rate(spa));
     7778 +        else
     7779 +                spa->spa_man_trim_rate = 0;
     7780 +
     7781 +        if (spa->spa_num_man_trimming) {
     7782 +                /*
     7783 +                 * TRIM is already ongoing. Wake up all sleeping vdev trim
     7784 +                 * threads because the trim rate might have changed above.
     7785 +                 */
     7786 +                cv_broadcast(&spa->spa_man_trim_update_cv);
     7787 +                mutex_exit(&spa->spa_man_trim_lock);
     7788 +                return;
     7789 +        }
     7790 +        spa_man_trim_taskq_create(spa);
     7791 +        spa->spa_man_trim_stop = B_FALSE;
     7792 +
     7793 +        spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_START);
     7794 +        spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
     7795 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     7796 +                vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
     7797 +                vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
     7798 +                vti->vti_vdev = vd;
     7799 +                vti->vti_done_cb = (void (*)(void *))spa_vdev_man_trim_done;
     7800 +                vti->vti_done_arg = spa;
     7801 +                spa->spa_num_man_trimming++;
     7802 +
     7803 +                vd->vdev_trim_prog = 0;
     7804 +                (void) taskq_dispatch(spa->spa_man_trim_taskq,
     7805 +                    (void (*)(void *))vdev_man_trim, vti, TQ_SLEEP);
     7806 +        }
     7807 +        spa_config_exit(spa, SCL_CONFIG, FTAG);
     7808 +        time_update_tx = spa_trim_update_time(spa, gethrestime_sec(), 0);
     7809 +        mutex_exit(&spa->spa_man_trim_lock);
     7810 +        /* mustn't hold spa_man_trim_lock to prevent deadlock /w syncing ctx */
     7811 +        if (time_update_tx != NULL)
     7812 +                dmu_tx_commit(time_update_tx);
     7813 +}
     7814 +
     7815 +/*
     7816 + * Orders a manual TRIM operation to stop and returns immediately.
     7817 + */
     7818 +extern void
     7819 +spa_man_trim_stop(spa_t *spa)
     7820 +{
     7821 +        boolean_t held = MUTEX_HELD(&spa->spa_man_trim_lock);
     7822 +        if (!held)
     7823 +                mutex_enter(&spa->spa_man_trim_lock);
     7824 +        spa->spa_man_trim_stop = B_TRUE;
     7825 +        cv_broadcast(&spa->spa_man_trim_update_cv);
     7826 +        if (!held)
     7827 +                mutex_exit(&spa->spa_man_trim_lock);
     7828 +}
     7829 +
     7830 +/*
     7831 + * Orders a manual TRIM operation to stop and waits for both manual and
     7832 + * automatic TRIM to complete. By holding both the spa_man_trim_lock and
     7833 + * the spa_auto_trim_lock, the caller can guarantee that after this
     7834 + * function returns, no new TRIM operations can be initiated in parallel.
     7835 + */
7568 7836  void
7569      -spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
     7837 +spa_trim_stop_wait(spa_t *spa)
7570 7838  {
7571      -        spa_event_post(spa_event_create(spa, vd, hist_nvl, name));
     7839 +        ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
     7840 +        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
     7841 +        spa->spa_man_trim_stop = B_TRUE;
     7842 +        cv_broadcast(&spa->spa_man_trim_update_cv);
     7843 +        while (spa->spa_num_man_trimming > 0)
     7844 +                cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
     7845 +        while (spa->spa_num_auto_trimming > 0)
     7846 +                cv_wait(&spa->spa_auto_trim_done_cv, &spa->spa_auto_trim_lock);
     7847 +}
     7848 +
     7849 +/*
     7850 + * Returns manual TRIM progress. Progress is indicated by four return values:
     7851 + * 1) prog: the number of bytes of space on the pool in total that manual
     7852 + *      TRIM has already passed (regardless if the space is allocated or not).
     7853 + *      Completion of the operation is indicated when either the returned value
     7854 + *      is zero, or when the returned value is equal to the sum of the sizes of
     7855 + *      all top-level vdevs.
     7856 + * 2) rate: the trim rate in bytes per second. A value of zero indicates that
     7857 + *      trim progresses as fast as possible.
     7858 + * 3) start_time: the UNIXTIME of when the last manual TRIM operation was
     7859 + *      started. If no manual trim was ever initiated on the pool, this is
     7860 + *      zero.
     7861 + * 4) stop_time: the UNIXTIME of when the last manual TRIM operation has
     7862 + *      stopped on the pool. If a trim was started (start_time != 0), but has
     7863 + *      not yet completed, stop_time will be zero. If a trim is NOT currently
     7864 + *      ongoing and start_time is non-zero, this indicates that the previously
     7865 + *      initiated TRIM operation was interrupted.
     7866 + */
     7867 +extern void
     7868 +spa_get_trim_prog(spa_t *spa, uint64_t *prog, uint64_t *rate,
     7869 +    uint64_t *start_time, uint64_t *stop_time)
     7870 +{
     7871 +        uint64_t total = 0;
     7872 +        vdev_t *root_vd = spa->spa_root_vdev;
     7873 +
     7874 +        ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
     7875 +        mutex_enter(&spa->spa_man_trim_lock);
     7876 +        if (spa->spa_num_man_trimming > 0) {
     7877 +                for (uint64_t i = 0; i < root_vd->vdev_children; i++) {
     7878 +                        total += root_vd->vdev_child[i]->vdev_trim_prog;
     7879 +                }
     7880 +        }
     7881 +        *prog = total;
     7882 +        *rate = spa->spa_man_trim_rate;
     7883 +        *start_time = spa->spa_man_trim_start_time;
     7884 +        *stop_time = spa->spa_man_trim_stop_time;
     7885 +        mutex_exit(&spa->spa_man_trim_lock);
     7886 +}
     7887 +
     7888 +/*
     7889 + * Callback when a vdev_man_trim has finished on a single top-level vdev.
     7890 + */
     7891 +static void
     7892 +spa_vdev_man_trim_done(spa_t *spa)
     7893 +{
     7894 +        dmu_tx_t *time_update_tx = NULL;
     7895 +
     7896 +        mutex_enter(&spa->spa_man_trim_lock);
     7897 +        ASSERT(spa->spa_num_man_trimming > 0);
     7898 +        spa->spa_num_man_trimming--;
     7899 +        if (spa->spa_num_man_trimming == 0) {
     7900 +                /* if we were interrupted, leave stop_time at zero */
     7901 +                if (!spa->spa_man_trim_stop)
     7902 +                        time_update_tx = spa_trim_update_time(spa, UINT64_MAX,
     7903 +                            gethrestime_sec());
     7904 +                spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_FINISH);
     7905 +                spa_async_request(spa, SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY);
     7906 +                cv_broadcast(&spa->spa_man_trim_done_cv);
     7907 +        }
     7908 +        mutex_exit(&spa->spa_man_trim_lock);
     7909 +
     7910 +        if (time_update_tx != NULL)
     7911 +                dmu_tx_commit(time_update_tx);
     7912 +}
     7913 +
     7914 +/*
     7915 + * Called from vdev_auto_trim when a vdev has completed its auto-trim
     7916 + * processing.
     7917 + */
     7918 +static void
     7919 +spa_vdev_auto_trim_done(spa_t *spa)
     7920 +{
     7921 +        mutex_enter(&spa->spa_auto_trim_lock);
     7922 +        ASSERT(spa->spa_num_auto_trimming > 0);
     7923 +        spa->spa_num_auto_trimming--;
     7924 +        if (spa->spa_num_auto_trimming == 0)
     7925 +                cv_broadcast(&spa->spa_auto_trim_done_cv);
     7926 +        mutex_exit(&spa->spa_auto_trim_lock);
     7927 +}
     7928 +
     7929 +/*
     7930 + * Determines the minimum sensible rate at which a manual TRIM can be
     7931 + * performed on a given spa and returns it. Since we perform TRIM in
     7932 + * metaslab-sized increments, we'll just let the longest step between
     7933 + * metaslab TRIMs be 100s (random number, really). Thus, on a typical
     7934 + * 200-metaslab vdev, the longest TRIM should take is about 5.5 hours.
     7935 + * It *can* take longer if the device is really slow respond to
     7936 + * zio_trim() commands or it contains more than 200 metaslabs, or
     7937 + * metaslab sizes vary widely between top-level vdevs.
     7938 + */
     7939 +static uint64_t
     7940 +spa_min_trim_rate(spa_t *spa)
     7941 +{
     7942 +        uint64_t smallest_ms_sz = UINT64_MAX;
     7943 +
     7944 +        /* find the smallest metaslab */
     7945 +        spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
     7946 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     7947 +                smallest_ms_sz = MIN(smallest_ms_sz,
     7948 +                    spa->spa_root_vdev->vdev_child[i]->vdev_ms[0]->ms_size);
     7949 +        }
     7950 +        spa_config_exit(spa, SCL_CONFIG, FTAG);
     7951 +        VERIFY(smallest_ms_sz != 0);
     7952 +
     7953 +        /* minimum TRIM rate is 1/100th of the smallest metaslab size */
     7954 +        return (smallest_ms_sz / 100);
7572 7955  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX